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Preface to the First Russian Edition 


For a long time it so happened that almost no information on the scientific 
research carried out in the field of mathematical theory percolated beyond the 
realm of a restricted circle of professional mathematicians. This circumstance 
sometimes even led the non-specialists to an entire incorrect notion of absolute 
completeness of mathematics, envisaging new research in this field to be almost 
impossible or, in any case, extremely tedious. The reason for such situation is 
explained by the fact that an overwhelming majority of recent works published 
in mathematical journals are related to sufficiently developed branches of this 
science which are incomprehensible to a person having no special training. As 
regards more elementary areas of mathematics, like elementary geometry, it is 
difficult to suppose the existence of any facts or theorems of really crucial the- 
oretical value that has gone unnoticed by several generations of workers in this 
area.t Also the new significant directions that have emerged in pure and applied 
mathematics during the recent decades, as a rule, are confined to sufficiently 
complex concepts and ideas offering little scope for their popularization. Viewed 
in this context, the credit to C. E. Shannon, the well-known American applied 
mathematician, becomes all the more due, for his ability to inaugurate in 1947- 
1948 a new important domain of mathematics, which stemmed from quite ele- 
mentary considerations. 

The basic problems confronted by Shannon in the initiation of the discipline 
which was later designated as information theory, were connected with engineer- 
ing questions related to electrical and radio communications.{+ Generally speak- 
ing, newly emerging applications of mathematics in engineering and natural 


tHowever, as a matter of fact, even in these elementary areas Of mathematics some serious 
questions still remain open. Therefore, it is not surprising that sometimes stimulating and 
fundamental works appear related e.g., to elementary geometry. See, for instance, W.G. Bolt- 
yanskii’s Equivalent and Equidecomposable Figures, Fizmatzig, Moscow, 1956 (English trans- 
lation published by D.C. Heath, Boston, 1963), based mainly on quite recent results from ele- 
mentary geometry, 

tfOwing to the general character of Shannon’s work, it exerted a great stimulating effect on 
the entire research related to the transmission and preservation of any information met with 
in nature and technology. The channels through which this information is transmitted may be 
not only the telegraphic and telephonic wires or media transmitting radio-signals but also the 
nerves through which signals from organs of sense are transmitted to muscles via brain, or 
those yet almost completely unexplored paths by which the indications of future structural 
plan of living organism from an embryonic cell are transmitted. 
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sciences are usually closely related to the use of complex mathematical notions 
and methods. Hence quite often they are also not susceptible to elucidation 
without a deep insight into the intricate problems of modern science and tech- 
nology. This circumstance has severely restricted the opportunities of populariza- 
tion of recent practical achievements of mathematics. Hence, the idea of a non- 
specialist about the importance of applied mathematics often remains confined 
to his intelligence drawn from school courses regarding the fact that geometry 
was used in ancient Egypt for reestablishment of land boundaries after floods 
in the Nile and a few similar facts. And, in this respect, the exposition of a 
string of ideas related to the information theory represents an extremely allur- 
ing theme for popularization, since the simplest practical applications of these 
ideas to Modern engineering problems can be explained fully even to the readers 
who have a minimum mathematical and engineering background. 

The present book, designed fora wide circle of readers (familiarity with math- 
ematics up to high school level suffices for comprehension of all of its contents), 
makes, of course, no claim to serve even as an elementary introduction to the 
scientific information theory. We can give here only a preliminary idea of import- 
ant practical applications of this theory. Similarly, it shall not be possible to 
deal here with the deeper purely mathematical problems connected with the in- 
formation theory. The main aim of the authors is much simpler : it consists of 
acquainting the reader with certain, though not complex but highly important, 
new mathematical ideas, and leading him through these ideas to an understand- . 
ing of one of the possible means of employing mathematical methods of modern 
engineering. 

The first chapter of the book is devoted to the exposition of the classical (in- 
troduced as early as seventeenth century) concept of probability, acquaintance 
with which is necessary for a comprehension of all the content matter that fol- 
lows. In the second chapter, the recent concepts of entropy and informatian due 
to Shannon are considered, whose general scientific value has been evaluated by 
the mathematicians only during the last few years. The third and fourth chap- 
ters present examples and applications. In contrast to the preceding two chap- 
ters, rigorous proofs of statements made here are often just outlined or com- 
pletely omitted, and in certain cases such statements are even formulated only 
in the form of highly plausible propositions. Furthermore, in the third chapter 
we have demonstrated the usefulness of the concepts of entropy and information 
by recreative problems on guessing numbers, counterfeit coins and so on (these 
problems are in a sense similar to problems on playing cards and dices that led 
to the emergence of probability theory in the seventeenth century). The engine- 
ering applications to communication theory that are richer in content are discus- 
sed in the fourth chapter. We expect the reader’s acquaintance with the re- 
creative third chapter to enable him to develop a better grasp of the meaning 
of basic concepts introduced in Chapter 2, and by the same token to prepare 
himself for a study of Chapter 4, which is the most complex part of the book 
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and also uses some results of the third chapter. 

Though the book is designed for all lovers of mathematics, it is meant pri- 
marily for high school and undergraduate college students and teachers. To- 
gether with them, it must also be of interest to many readers who have special- 
ized in communication engineering but do not possess a sound mathematical 
background. The book is based on a lecture delivered by one of the authors to a 
group of high school participants of a special mathematics study group at 
Moscow State University. The contents of this lecture have, however, been 
expanded considerably here. 

The authors express their sincere gratitude to A. N. Kolmogorov, whose valu- 
able suggestions contributed to an appreciable improvement of the book. They 
are also thankful to M. M. Goryachaya, the editor of the book, whose remarks 
helped in correcting certain deficiencies of the primary exposition. 


A. M. YAGLOM 
Moscow, May, 1956 I. M. YAGLOM 


Preface to the Second Russian Edition 


The second edition of the book Probability and Information does not differ in 
structure substantially from the first edition. A comparison of the table of 
contents of both the editions of the book will make it clear that the structural 
variations between the two are quite insignificant. The character of the book has 
aiso not been changed, assuming of the reader a quite modest mathematical 
knowledge (which deficiency must, however, be counterbalanced with certain 
persistence). Nevertheless, there are specific distinctions between the two edi- 
tions which are so significant that we may now speak of it as being a new book 
rather than a revised edition. 

Such crucial changes have partially stemmed from the fact that this book deals 
with a very young and rapidly developing branch of science, for which an inter- 
val of two years between the first and the second editions constitutes a note- 
worthy gap. The authors tried to keep themselves abreast with the developments 
that took place during these two years. This was accomplished by them to a 
great extent by looking over numerous new books and papers, since the litera- 
ture on information theory has proliferated during this perlod with stupendous 
intensity. However, it is the one omission due to the authors which singularly 
necessitated the revision of the first edition. 

The present book has grown from a lecture delivered to a group of Moscow 
high school students interested in mathematics. The authors firmly bear in mind 
this genesis of the book, to which the readers obviously pay little attention. 
Accordingly, in the Preface to the first edition of the book it was stated that it 
is designed for all lovers of mathematics and primarily for high school teachers 
and students. In this connection, we, however, overlooked one more category 
of numerous readers, consisting of people who are seriously interested in the 
information theory (and not in mathematics in general), but do not desire to 
embark upon its study through specialized literature, whose thorough grasp in- 
volves both time and efforts. The book drew the greatest appreciation from the 
professional mathematicians and communication engineers and our remonstranc- 
es that it was not intended for readers from either of these categories failed 
to produce any effect. The authors were taken by surprise by the swiftness with 
which the first edition of the book disappeared from the market and was trans- 
lated in several foreign languages (e.g., Hungarian, German, French and 
Japanese). Such overwhelming response forced us to concede that the book 
does meet some vital needs and prompted us to focus our attention on how this 
requirement could be served more adequately. 
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We are now also inclined to consider that our book is unsuitable for the read- 
ers who are interested in the sophisticated topics of the mathematical informa- 
tion theory or of communication engineering. For the former class of readers, 
it is natural to recommend Feinstein [9]t, comparatively a concise but terse 
book. For readers of the second category, Woodward [24] is obviously a quite 
suitable and fascinating work. Also, the physicists or biologists, who are in- 
terested in Shannon’s ideas, would not naturally turn to our book but to Bril- 
louin [5] (for physicists) and Ashby [3] (for biologists). However, it could pos- 
sibly be profitable even for many readers from all such categories to acquaint 
themselves with the present elementary book as a starting point. It is only for 
the philologists, who currently represent a sufficiently significant group of ‘users’ 
of information theory that we had nothing to suggest; this led us to devote 
greater attention to the problems encountered by them in the second edition of 
our book. And, if during the preparation of the new edition we have rejected 
as before any material whose inclusion could raise the mathematical level beyond 
what is required for the reading of the first edition, then we have also kept in 
view this time not only the school students, but also the biologists or philolo- 
gists who are not familiar with calculus. 

This requirement to cater for a wider circle of readers of the book necessitat- 
ed a series of essential changes in the text. Thus, for example, in the new edition 
the capital Russian letters 5 (entropy) and \/| (information) are removed. In 
fact, these unusual notations could have facilitated the reading of the book by 
completely inexperienced readers, but at the same time they caused inconveni- 
ence to all those who had (or desired in future to have) to do also with other 
literature on information theory, using different notations. It was also natural 
that in Chapter 2 we paid adequate attention to the statistical interpretation of 
the concept of entropy, making it quite fruitful for all practical applications of 
information theory. We have considerably expanded the last chapter that has 
the greatest applied value; the volume of the book has also been enlarged with 
the addition of the supplementary material printed in small type (that may 
even be skipped in the first reading). In particular, taking account of the inter- 
est of mathematicians, we have derived in these supplements rigorous proofs of 
certain premises that have been merely propounded in the basic text. The 
character of the problems in the book has also been changed; in the present 
edition exercises on the urn scheme and mathematical recreations occur uncom- 
monly but then the practical problems of the applied information theory are 
more frequent. However, we have preserved the entire chapter devoted espe- 
cially to the elementary problems on quick wits, since it is essentially through 
these problems in a new (and sufficiently attractive) form that we have tackled 
pretty serious questions of the most economic message transmission. This rela- 


tThe digits in the square bracket indicate the number at which the reference is listed at 
the end of the book. 
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tionship which we found to have been missed by some readers of the first edi- 
tion, has now been more prominently highlighted. 

The present edition of the book is supplemented with a bibliography which the 
first edition had lacked. Being convinced (in particular by the experience gained 
during our work on this book) of the computational convenience that is conferr- 
ed by the table of values of the function —p log p (where 0 < p < 1), we have 
included such table as Appendix III in the book. The binary system of log- 
arithms has been retained in this table; in the text, however, we have employed 
decimal logarithms, which have a wider acceptance from the majority of readers 
(especially because we desired to demolish the notion held by several engineers 
that the use of binary logarithms forms precisely the basis of information theory). 

In conclusion, the most significant change is the addition of a special Section 
4.3, which gives a resume of the data on information contained in various specific 
types of messages (written and spoken language, music, television and photo- 
telegraphic images). At the end of this section we have also briefly cited some 
data on the capacity of different communication channels. This is the largest 
section in the book; it is not used directly in the following text and can be skip- 
ped completely by a reader who is interested in only the mathematical side of 
information theory. To us, however, it appears that the number of those readers 
will be considerably large for whom this section proves to be of highest appeal. 
Section 4.3 is somewhat of a distinctive character from the rest of the book— 
factually, it presents a review of a large number of comparatively more special- 
ized papers that have appeared recently in different scientific and engineering 
journals. For the convenience of readers who are interested in some specific 
field of the applications of information theory, we have indicated in all cases 
the exact source that contains a more elaborate exposition of the results men- 
tioned by us (the major portion of the bibliography appended to the book is 
related to this section). It has also been our endeavour to make our review as 
complete as possible (to the extent to which it could be possible without violat- 
ing the elementary character of the book). However, it has been necessary 
to bear in mind that owing to the intensity with which the study of statistical 
properties of messages and communication channels is being pursued all over 
the world during present times, it is apprehended that the review Section 4.3 
may become deficient by the time the book appears and a few years later the 
data presented in it may become substantially outdated. However, we feel that 
even then Section 4.3 shall not lose its utility. In fact, the basic objective of this 
section is to give an idea of the order of magnitudes of the amount of informa- 
tion met with in science and technology, and to illustrate the general directions 
in which engineering, philological and biological studies have been inspired by 
information theory, but not to provide at all a base for the scientific research 
work of specialists. 

Finally, we wish to thank sincerely all our readers who communicated us 
their comments, which assisted us in the preparation of the new improved 
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edition. In particular, we wish to thank S. G. Gindikin, A. N. Kolmogorov, 
V. I. Levenstein, P. S. Novikov, I. A. Ovseevich, S. M. Rytov, V. A. Uspenski, 
G. A. Shestopal, M. I. Eidelnant and especially R. L. Dobrushin and A. A. 
Kharkevich. We are also grateful to V. A. Garmash, L. R. Zinder, D. S. Lebedev 
and T. N. Moloshnaya for fruitful discussions we had with them on the problems 
connected with the contents of Section 4.3 of the present book. 


A. M. YAGLOM 
Moscow, March 1959 I. M. YAGLOM 


Preface to the Third Russian Edition 


The first edition of the book was published in 1957 and the second one in 1960. 
However, there is a passage of thirteen years between the second and third edi- 
tions. We ourselves must apologize for such a considerable gap between the last 
two editions. Though the second edition of the book was long back reduced to 
the status of a mere bibliographic rarity leading to a spate of enquiries from the 
readers and repeated overtures from the publishing house for its revision, we 
could not somehow make up our mind. It was clear to us that it was impos- 
sible to keep the book in the form it had in the second edition, because it was 
necessary to incorporate in it the substantial changes that had taken place dur- 
ing these years in information theory. Such thorough revision of the book (ac- 
companied with the alteration of even its title as suggested by many) obviously 
presented a highly laborious and involved task, which was perhaps beyond our 
stamina. 

We eventually took recourse to the way of compromise, which is almost al- 
ways chosen by the people placed in an inconvenient situation. The present 
third edition of the book retains the earlier title and much of its original look. 
Thus, for example, we do not assume of the reader, as before, the background 
beyond the level of high school mathematics. The book accordingly still re- 
mains simpler than all the other existing text books and monographs giving an 
exposition of information theory. At the same time, we could not also ignore 
the circumstance that, to our surprise, the second edition of Probability and In- 
formation was used both within and outside our country in a series of cases as 
the basic textbook for delivering lecture courses in colleges and universities. 
Hence during the revision and enlargement of the text we had the added im- 
petus to make the book more suitable for such use, earlier not foreseen by us. In 
particular, we have refrained from using in the text the common decimal log- 
arithms and the uncommon decimal units for the measurement of the amount 
of information (dits) which thereby eliminated the last shred of direct evidence 
of this book having grown out of a lecture delivered to school students many 


years ago.T 
The last Chapter 4, which is also the most important chapter in the book, has 


fIn the literature addressed to school students, the use of binary logarithms creates some 
impression of artificiality. However, in a book on information theory designed for more 
mature readers, such impact is liable, contrarily, to promote the employment of decimal log- 


artithms in place of the universally used binary ones, 
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undergone the maximum revision, since Chapters 1—3 actually represent only 
an introduction to the basic content material of the book that has been brought 
into focus in Chapter 4. Keeping in view the readers who desired to be acquaint- 
ed through this book with the mathematical fundamentals of information theory, 
we have included in Section 4.2 an exposition of optimal Huffman codes (the- 
oretically more important than the Shannon-Fano code considered in the pre- 
vious editions also) and substantially sharpened the proof of fundamental noise- 
less coding theorem, making it more compact and mathematically precise. Sec- 
tion 4.4 is still more extensively modified where we have deduced, in particular, 
two new proofs of the fundamental noisy coding theorem together with a simple 
proof of the converse to coding theorem. The same purpose is also served by 
the inclusion in first chapter of the law of large numbers, which permits us to 
make later some more rigorous deductions, and also by an appreciable increase 
in the number of references from serious scientific literature, to the study of 
which this book provides a natural bridge. 

However, the most crucial circumstance we had to take into consideration in 
the preparation of the revised edition of our book is that during the last two 
decades even the frontiers of information theory underwent a substantial change. 
In the present times, the most important part of information theory is indisput- 
ably the coding theory, whose rapid development was impossible to be forecast 
at the time the earlier editions were written. Hence, today even a popular work 
on information theory will be irrelevant if it completely ignores that branch of 
this subject which attracts greatest interest of both the theoreticians and practical 
engineers and engages lion’s share of efforts of the specialists in information 
theory throughout the world. On the other hand, the general character of coding 
theory and mathematica! tools and methods applicable to this important and 
elegant field of applied mathematics differ quite substantially from the basic 
contents of our book. The reorientation of the book to the direction of coding 
theory would have involved rewriting the book afresh. Hence, here also we have 
kept to the middle of road: we have added to Chap. 4 a completely new conclud- 
ing section to provide just an introduction to the tasks and techniques of cod- 
ing theory; as a matter of fact, even in its present form this section is appreci- 
ably out of tune with the rest of the contents of the book. This gap motiv- 
ated us to add to the book a new Appendix II devoted to certain purely algeb- 
raic concepts and propositions; however, as a compensatory feature we have 
omitted Appendix II of the second edition as it had become superfluous after the 
revision of the main text. Strictly speaking, the new Appendix II is not prere- 
quisite for following the content matter of Section 4.5 devoted to the coding 
theory; however, an overview of this appendix before taking up the indicated 
section will obviously enable the reader to have a greater insight into the potent- 
ialities of further development and extension of the results of this section. 

A singular place is occupied in this book by Section 4.3 of which we have said 
jn sufficient length in the Preface to the second edition. This section contains 
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a review of the data on various specific types of messages which as far as known 
to us is a unique resume of this kind in the literature; the latter circumstance 
also motivated us to enlarge this section further by including in it a review of 
the majority of more recent works. It is obvious that in spite of the extensive 
expansion of the reference list, we cannot make a pretence to have covered all] 
of the printed literature on the topics considered. It is quite possible that works 
scattered over a vast number of journals in highly diverse stray fields might have 
escaped our notice. We must also caution the reader that we have not concerned 
ourselves with the verification of numerical data available in various investiga- 
tions and an analysis of the extent of their statistical reliability. It seems that 
much work still remains to be done in the latter direction. However, despite 
the fact that not all of the data adduced in Section 4.3 is completely reliable, 
its inclusion in the book is justified, for it enables the reader to get here a suffic- 
iently complete idea of the results achieved so far in a number of specific fields 
of information theory and of the general directions of major researches in these 
fields. 

Of course, many aspects related to information theory have not been touched 
upon in our book. Apart from the natural infeasibility ‘to envelop the bound- 
less’, the limitation is partly set upon by our proneness to retain in the present 
edition the look this book earlier had. Thus, for example, we have as before 
almost completely ignored in it the problems connected with the estimation of 
entropy and information of experiments with an infinite set of possible outcomes 
(as regards the general concepts and definitions involved here see, for example, 
[12]). We do not also concern ourselves with the so-called ‘algorithmic’ approach 
to the concept of the amount of information (for salient works in this direction, 
see, for example, [15] and [27]); moreover, a combinatorial treatment of this con- 
cept is only briefly sketched in Section 4.3. Finally, all attempts at broad inter- 
pretations of the concept of information beyond the framework of Shannon’s 
theory (of the type of ‘semantic information’ or ‘thesarus’; see, for example, 
[4], [13] and [20}) fall beyond the scope of this book (these attempts are of quite 
preliminary nature till now). 

As is well known, the main value of Preface is that it enables the authors to 
thank all those who assisted them in their work. A. N. Kolmogorov has been 
kind enough to place at our disposal his remarkable (unpublished) manuscript 
designed to refine substantially Shannon’s guessing method for estimation of the 
entropy of written language which has been discussed at length in this book. 
Some additional material related to the entropy of a language has also been 
contributed by A. V. Prokhorov. We must also mention that V. V. Ivanov, 
I. A. Ovseevich, N. V. Petrova, B. S. Tsybakov and W. Endres brought to our 
notice some literary sources, which we used to enlarge Section 4.3. The contents 
at a number of places in the book bear the stamp of numerous discussions we 
had with R. L. Dobrushin on the topics from information theory. S. Z. Stam- 
bler, the editor of the third edition, carefully read the entire text and contributed 
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to its further improvement. He also supplied us with a long list of additional 
references that we have used during the preparation of our book. We express 
our sincere gratitude to all the persons mentioned here. 


A. M. YAGLOM 
Moscow, March 1972 I. M. YAGLOM 


Preface to the English Edition 


The panoramic history of this book is described in the prefaces of its Russian 
editions. It has taken to a chequered and not customary course of develop- 
ment; to start with it was a small elementary book for teenagers based on a 
lecture delivered by one of us 23 years ago to a group of Moscow high school 
students. The primary aim of the book was to expose the relationship of cer- 
tain mathematical recreation exercises with rather serious and very interesting 
mathematical methods developed recently in engineering sciences in order to 
stimulate young readers’ interest in modern mathematics. Later, however, the 
book began to live independently of our wishes. We received a lot of letters 
and comments from our readers and almost all of them turned out to be grown- 
ups having no leisure for recreations but seriously interested in the information 
theory. Therefore, we changed considerably the scope of our book inthe second 
and third editions in an effort to meet the demands of the new (and, as we dis- 
covered, the predominant) category of our readers. As a result, the book deve- 
loped into a thick volume intended for a wide community of people interested in 
various applications of the modern information theory, but having no special 
mathematical background (in fact, even the requirement of the knowledge of 
elementary differential calculus is dispensed with in our book). 

The book scored a remarkable success in other countries also, and this was 
obviously caused by a widespread interest in the ideas of information theory all 
over the world. The book was translated into at least 10 foreign languages and 
some of the translations underwent several editions which differed from each 
other (and also from all the corresponding Russian editions since, wherever 
possible, we tried to send to the publishers some supplementary material). How- 
ever, for a long time the opportunity of the publication of English translation 
kept on eluding us. We received twice letters from publishers of repute (one 
in the U.S.A. and the other in U.K.), seeking our permission to publish the 
English edition of the book. In both the cases, we gave the permission and 
even sent some corrections and supplements. It seems to us that on both the 
occasions the translation work was started but then some technical difficulties 
thwarted the completion of the work. Therefore, we are happy that Hindustan 
Publishing Corporation have finally published the English translation of our 
book and thus made it accessible to a wide circle of new readers. 

The English edition differs from all the previous ones. Besides the minor 
corrections and improvements, we have completely revised (and considerably 
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extended) Section 4.3, for it is clear that the discussion of the amount of inform- 
ation contained in the spoken and written text messages must now be based 
on the data related to the English (and not Russian) language. We have also 
enlarged the concluding Section 4.5 by supplementing it with a description of 
the method of constructing the practically important Bose-Chaudhuri-Hocquen- 
ghem error-correcting codes. This necessitated the inclusion of some addition- 
al material in Appendix II at the end of the book, since the role of this append- 
ix was widened further in comparison to the Russian edition. In order to 
update the book, Section 4.3 has been further reinforced by inclusion of the des- 
cription of some latest works though, of course, it is not possible to claim that 
we have covered all recent papers, which are too numerous to cater for. We 
have also added a new Appendix IV which contains a short table of the func- 
tion h(p) = —p log p — (1 — p) log (1 — p), keeping in view the usefulness 
of such table for educational purposes, which is one of the avowed objectives 
of the book. 

We are glad to express here our appreciation to Hindustan Publishing Corp- 
oration for production of the book and to Drs. B. Mandelbrot, T. M. Cover 
and T. Nemetz who have sent us some new material used in the preparation of 
the present edition. 


Moscow and Yaroslavl A. M. YAGLOM 
June, 1983 I. M. YAGLOM 


1 


Probability 


1.1. Definition of probability. Random events and random variables 


In practice, we quite frequently encounter experiments (variously, trials, obser- 
vations, processes) which yield different results depending on whether the situa- 
tions are unknown or unaccounted for. Thus, for example, when throwing a die 
(a homogeneous cube with its faces numbered from | to 6), we cannot know in 
advance what face will turn up, since this depends on many unknown factors 
(details of hand movement resulting in throwing, die position at the instant of 
roll, peculiarities of the underlying surface, and so on). It is equally impossi- 
ble to forecast beforehand the number of secondary school graduates that will 
enter a given college during a specific year, the number of defective items pro- 
duced by a factory on a given day, or the number of rainy days that will occur 
next year. Similarly, there is no way of predicting the number of errors that 
will be committed by a school student in the homework, or the ticket number 
that will draw the first prize in a prospective lottery draw (the number of winn- 
ing tickets are determined by drawing from a well-shufMled lot of numbered 
tickets in a container), and so on. The number of similar examples can obvious- 
ly be augmented considerably. 

The application of mathematics to a study of such phenomena is based on the 
following fact. In many cases when the same experiment is repeated many times 
under identical conditions the frequency of occurrence of the result under considera- 
tion (i.e., the ratio of the number of occurrences of this result to the total num- 
ber of trials) always remains approximately the same, close to some constant 
number p. For example, it is thus known that the frequency with which a gun 
will hit a target under a given set of shooting conditions, as a rule, always re- 
mains almost the same and seldom deviates significantly from a certain average 
number (with the passage of time, this average number may apparently vary—in 
such cases we say that the marksman is improving upon, or conversely worsen- 
ing his performance). Also the frequency with which a six shows up on the die 
or the percentage of defective items under a given set of conditions usually devia- 
tes little when the related ‘trials’ (throw of die or manufacture of a given item) 
are repeated on a mass scale. Proceeding from this, we conclude that in each 
case there exists a definite constant number which objectively characterizes the 
yery process of shooting, die rolling, production of items, and so on. About this 
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constant the average frequency for the corresponding outcome (hits of a target, 
appearance of a six, emergence of defective items) fluctuates all the time (but 
does not deviate from it significantly) in the given series of ‘trials.’ The corres- 
ponding constant number is called probability of the event under investigation. 
Probability is defined similarly in a series of other problems related to such 
widely divergent fields as mathematics, mechanics, physics, engineering, economy 
and biology. The discipline that studies the properties of probability and 
various applications of this concept is called the probability theory. 

According to the discussion above, the probability of some event can be 
evaluated approximately from the outcomes of a long series of trials. However, 
obviously the very existence of a probability does not depend ultimately upon 
whether an experiment is performed or not. This raises a most natural question 
concerning the methods by which one can compute the probabilities of various 
events without first carrying out the corresponding experiments; by applying such 
methods we can make, beforehand, a forecast about the outcome of a succeed- 
ing trial, thus opening up great opportunities for the practical scientific applica- 
tions of the concept of probability. We shall not undertake here a detailed 
discussion of this question, but shall confine ourselves only to a very simple 
example from which, however, there can be derived a comparatively wide range 
of problems concerning the evaluation of probability. 

Suppose that we have a box (or as it is often said an urn) containing 10 well- 
mixed balls distinguishable from each other only by colour. Of these 5 are 
white, 3 black, and 2 red. We draw a ball from the urn without looking at it; 
the question is: What is the probability that this drawing will produce a ball of 
a specific colour? It is perfectly clear that here the chances are that out of 10 
drawings 5 will produce a white ball, 3 a black ball, and 2 red one; in other 
words, the probability of drawing a white, black, or red ball is, respectively, 
zs = 3,75, and ,4, = ¢. Also, indeed, if we repeat this particular experiment 
many times (every time returning the ball drawn to the urn and mixing all the 
balls well, we become convinced that, of all the drawings, roughly 50% result in 
a white ball, 30% in a black ball, and 20% in a red one. Naturally, the problem 
of determining the probability of any other configuration of balls of diverse 
colours, well-mixed and contained in an urn, is also solved in the same straight- 
forward manner, 

Let us consider a few more problems of the determination of probability, 
which reduce to the ‘urn model.’ 


The book by B. V. Gnedenko and A. Ya. Khinchin [31] is recommended to the reader 
desirous of a more thorough study of the probability theory and its applications. A much big- 
ger but quite readable book by F. Mosteller, R. E. K. Rourke and G. B. Thomas [38] is also 
highly suitable for primary acquaintance with the probability theory. See also, slightly more 
difficult artic'es by A. N. Kolmogorov [35] and M. Kac [33] and other related references at 
the end of this book. . oe 
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Problem 1. Jn flipping a coin at random, what is the probability that a ‘head’ 
will show up? 

This problem is obviously equivalent to the scheme of placing two balls in an 
urn, of which one is marked ‘head’ and the other ‘tail’ (of course, instead of 
inscribed balls, one can consider balls of two different colours, for example, a 
white and a black). What is the probability that a random drawing of a ball 
from the urn will produce a ball inscribed ‘head’? It is clear that the desired 
probability here is 4. 


Problem 2. In rolling a die at random, what is the probability of getting an 
integer divisible by 3? 

Instead of rolling a die, we may speak of drawing a ball from an urn contain- 
ing six balls numbered 1, 2, 3, 4, 5, and 6. Now if the third and sixth balls 
are coloured black and the rest are left white, we arrive at the problem of 
determining the probability of drawing a black ball (the numbers 3 and 6 are 
divisible by 3, but the others are not). It is evident that the desired probability 
here is ¢ = 4. 


Problem 3. The gathering at a students’ evening is known to consist of twenty 
students from the first college, twenty-five from the second, and thirty from the 
third college. What is the probability that the student with whem vou randomly 
talked studies at the second college? 


This problem obviously corresponds to the schcme of an urn containing 75 
balls, of which 20 are white, 25 are black, and 30 are red. What is the proba- 
bility that when a ball is drawn randomly from the urn, it will be a black one? 
Clearly, this probability is 32 = 4. 

We now proczed to grasp the general principles of solving all these problems. 
In the urn scheme that we discussed as a preface to these problems, the condi- 
tion that the balls in the urn be well-mixed and drawn blindly implies that we 
may, with equal justification, expect the appearance of any of the balls contained 
in the urn or, in other words, that the drawing of every ball! is equally probable. 
But, since, there are in all 10 balls, it is natural to infer that the probability for 
a particular ball to be drawn is =. Further, since there are five white balls, 
the probability of drawing a white ball is , = 4. 

Exactly the same reasoning leads to the answers to Problems 1-3 above. 
Thus, for instance, in the case of a die roll, we assumed the appearance of any 
of the six faces to be equally probable; it is just this reason that enabled us to 
replace this problem by that of making a drawing from an urn containing six 
balls. However, of the six faces, precisely two are such that their appearance 
satisfies the hypothesis of the problem; the probability of the appearance of 
either of these two faces is 2? = }. 

If we postulate that the experiments under consideration (drawing of a ball 
from an urn, tossing of a coin, roling of a die, conversation with one of the 
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participants at a students’ evening, etc.) have m equally probable outcomes, then 
it is necessary to regard each of these outcomes as having probability 1/n. We 
now consider some event (the drawing of a white ball from an urn, occurrence 
of a ‘head’ when a coin is tossed, appearance of an even number when a die is 
rolled, a conversation with a student studying in the second college, and so on) 
to be determined by the outcomes of an experiment. If this event is realised in m 
out of all n equally probable outcomes of an experiment but not in the remain- 
ing # — m outcomes, then the probability of its occurrence is taken as m/n. 
In other words, the probability of a certain event is equal to the ratio of the num- 
ber of equally probable outcomes favourable to the given event to the total num- 
ber of equally probable outcomes. The italicised matter may be taken as the 
definition of the concept of probability: further, it must be stipulated in the 
description of the experiment to be performed that the distinct outcomes are 
equally probable. This objective is precisely served by indicating that the die 
has the exact shape of a cube and is made of homogeneous material, or that 
the balls are well-mixed and are indistinguishable from each other (except with 
regard to colour). Although, such a definition does not cover some important 
cases of the evaluation of probability (see, for example, papers [33], [35] and 
books [29], [30] and [39] as well as Section 5 of this chapter printed in small 
type), it is adequate for a majority of the cases considered in this book. 

Let us now agree on the terminology which we will need later on. An event 
which may or may not occur as the result of an experiment is called a random 
event; in the same sense we speak of the outcome of a given experiment. We 
shall use capital letters to denote random events and denote by p the probability 
of a random event (or of a specific outcome of an experiment); the probability 
of an event A is often written asp(A). An important role is played by an experi- 
ment that can have several different outcomes; in such a case, we denote all these 
outcomes by a single letter with different subscripts (and the experiment itself 
mostly by a Greek letter). 

To each such experiment there corresponds a specific probability table: 


Outcomes of experiment Ay A, one Ax 
Probability P(A,) p(A2) ... p(Ap) 


Thus, for example, the urn experiment discussed on p. 2 corresponds to the 
table 


(here, A, is the drawing of a white ball, A, of a black ball, and A, of a red 
ball). The experiment considered in Problem 1 is characterized by the simple 
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table : 


(here, B, and B, represent, respectively, the ‘head’ and ‘tail’ that can appear), 
The rolling of a die gives the following probability table: 


Number that appears on the face 


Probability 


ala] = 
a) nw 
al~ | & 
al} a 
al} wn 
al . 


It is worthwhile to note one salient difference between the last table and its two 
predecessors. The outcomes of the last experiment can be expressed by means 
of specific numbers (1, 2, 3, 4, 5 and 6), an opportunity not open to us in the 


Fig. 1. 


preceding examples. In this case, we can say that the number that appears on a 
face when a die is rolled, is a random variable which is capable of taking any one 
of all the six possible values, depending on chance (i.e., depending upon situa- 
tions that are not subject to predictability). Other examples of random vari- 
ables are the number of defective items per lot of 100, the number of births in 
some town per annum, the number of points scored by some marksman under 
prescribed shooting conditions in one round of firing (a target board showing 
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the number of points that are counted when each of its parts is hit is illustrated 
in Fig. 1), etc.t 

The very term ‘random variable’ demands that its value may vary but never- 
theless it may be somehow evaluated. The way this is to be achieved is not 
difficult to conceive. As an example, let us consider the first random variables 
enumerated above (the number of defective items in a lot of 100); suppose that, 
under defined production conditions, this number does not exceed 6, and the 
corresponding probabilities take the form of the following table: 


Number of defective items 0 1 2 3 4 5 6 
Probability 0.1 O15 0.2 0.25 0.15 0.1 0.05 


In such a case, in a large number N times a hundred items, roughly 0.1N do 
not contain a single defective item, 0.I5N contain one, 0.2N two, 0.25N three, 
0.15N four, 0.1N five and 0.05N six defective items. Consequently, for large 
N, the mean number a of defective items can be regarded as given by 


a=0.1N-04- 0.1S5SN- 1+ 0.2N-2 +4 0.25N - 3 
+ 0.15N-4+ 0.1N-5 +4 0.05N - 6. 


Hence, the mean value of the number of defective items per hundred (the mean 
percentage of defects) is given here by 


a/N = 0.1-04- 0.15.14 0.2.24 0.25.3 
+ 015-44 0.1 +5 + 0.05 - 6 = 2.7. 


In general, if the probability table for the random variable « has the form 


Values of random variable ay OQ, Gs... Gy 
Probability Pi P2 Ps +++ Pk 


then the mean yalue of this random variable is defined by 
M.V.% = P,Q, + p.Q, + pi, +... + pak. 


From this formula it follows, in particular, that the mean value of a random 
variable is just the mean, i.e., it never exceeds its maximum possible value nor is 
less than its minimum value. In fact, suppose that a, is the maximum value of 
the random variable « (i.e., a4, 2 @,, a, > a3,...,4, D ax) and az is its least 


The concept of random variables is incidental to the main theme of this book but occu- 
pies a central position in the theory of probability. In this connection, see, for example, the 
second part of B, V. Gnedenko and A. Ya. Khinchin’s book (31]. 


1.2. PROPERTIES OF PROBABILITY y | 


value (i.e., ay, < Q,, a < Qg,-+.» Gk < Q,-1), then 


M.V. & = py; + Podg +... + Pade S PQ, + Padi +... + ped, 
= (py + po t+... + pr)d = ay 


and 


M.v. % = pia, + pod, +... + Die > Pde + Pode +... + pede 
= (p, + pat... + pPryQe = Gt 


(for py) + pp t+... + pe =). 


Problem 4. Suppose that the probability tables showing the frequency for 
marksmen A and B hitting the target have the form: 


(i) For marksman A 


Number of points 0 1 2 3 4 5 6 7 8 9 10 
Probability 0.02 0.03 005 Of O15 O02 O02 O11 0.07 0.05 0.03 


(ii) For marksman B 


Number of points 0 1 2 3 4 5 6 7 8 9 10 
Probability 0.01 0.01 0.04 O.. 0.25 03 0.18 0.05 0.03 0.02 0.01 


Which of A and B should be regarded as the better marksman? 
Here, the mean number of points scored by A in one round is 


0.02-0+ 0.03-14+ 0.05-2+01-34+0.15-44+02-54+ 02-6 
+ 01-7 + 0.07: 8 + 0.05 -9 + 0.03 - 10 = 5.24, 


and that for B is 


0.01-0+ 0.01-1+4+ 004-24+0.1-34+0.25-4403-5+4+ 0.18 -6 
+ 0.05 -7 + 0.03-8 + 0.02-9 + 0.01 - 10 = 4.84 < 5.24, 


This shows that A is a better marksman. 


1.2. Properties of probability. Addition and multiplication of events. 
Incompatible and independent events 


From the definition of probability adduced in the preceding section, it follows 
that the probability p(A) of every event A is a real number in the range of 0 
and 1: 


0< pA) <1. 


Moreover, the probability may be 1, signifying that the event A is realized for 
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every outcome of the experiment under consideration, i.e., that A is the certain 
or sure event (thus, for example, the probability of drawing a white ball from 
an urn containing only white balls is 1). The probability may also be 0, imply- 
ing that the event is not realized for any outcome of the experiment, i.e., it is 
impossible (the probability of drawing a black ball from an urn containing only 
white balls is 0). 

Now, suppose that an experiment has only two mutually exclusive outcomes 
Aand B. In such a case, B is called the contrary event of A and is denoted by 
A (tead as ‘not-A’). If the event A is realized in m out of m equally probable 
outcomes of an experiment, then the event A is realized in the remaining n ~ m 
outcomes. Hence, 


p(A) = m/n, p(A) = (n— m)/n = 1 — (mn). 


Consequently, 


p(A) = 1 — p(A). 


Thus, the table of probabilities for an experiment having exactly two outcomes 
takes the simple form 


A A 
P(A) 1 — p(A) 


Let us now consider two events A and A, such that the occurrence of A 
necessarily implies the occurrence of A, (for example, A is the appearance of a 
six in rolling a die and A, the appearance of a number divisible by 3). In such 
a case, obviously A, must occur in all those outcomes of the experiment in which 
the event A is realized. Hence, the probablity of A, cannot be less than that of 
A. The situation in which the occurrence of A implies that of A, we write in 
symhols as A C A, (read as ‘“‘A implies A,’). We have, thus, the following 
important property of probability: 


if A C Ay, then p(A) < p(A,). 


We consider the event, which consists of the occurrence of at least one of the 
two fixed events A and B. Wecall this event the sum of events A and B and 
denote it by A + B. For this, there are two basically distinct cases. If the 
events A and B are incompatible, i.e., it is impossible for both of them to occur 
simultaneously, then A occurs in any m, out of n equally probable outcomes of 
an experiment and B in the different m, outcomes. We have in this case 


_ m1 Preuss — mt mM _ Mm , Mm 
P(A) = aoe p(B) = and p(A + B)= ; ie + a 
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i.€., 


P(A + B) = P(A) + p(B) 


(the addition law of probabilities). Thus, in the example considered on p. 3, the 
probability of drawing a white or a black ball, by virtue of this law, is, 


1 3 £4 

21 3 
The addition law of probabilities formulated above may be generalized as 
follows. Suppose that we have k events A,, As, ..., Ax, any two of which are 
Pairwise incompatible. We denote, by A, + A, +... + Ax, an event which 


consists in that at least one of these k events occurs. Then, obviously, 

P(A, + Ap +... + Ax) = P(A,) + p(Ae) +... + pdx). 
This more general result is sometimes also called the addition law of probabilities. 
In particular, if an experiment has k (and only k) distinct mutually exclusive 


outcomes, then the probabilities corresponding to it are given by the table: 


A, Ay ae Ax 
P(A) p(A2) ws P(A) 


where the numbers appearing in the lower row sum to 1, i.e., 
p(A,) + p(A2) +... + p(Ar) = 1. 


This stems from the fact that p(A,) +- p(A,) +... + p(Ax) = p(A, + Ag + 
+...-+ A,) and that A, + A, +... + A, is a sure event (because any one 
outcome of the experiment is certain to be realized), so that 


P(A, + Ap +... + Ax) = 1. 


Let us now assume that the events A and B may be compatible, i.e., can be 
realized simultaneously. In this case, it is, however, impossible to assert that 
p(A + B) = p(A) + p(B). Indeed, suppose that A occurs in m, and B in m, 
of 1 equally probable outcomes of the experiment. The event A + B is realized 
if the outcome that takes place is either the one from the first m, or the one from 
the second m,; however, since these two sets of m, and m, outcomes may have 
common events, the possibility is that the total number of outcomes in two sets 
may be less than m, + m,. Thus, in the general case, all that we can assert is 
that the probability of the sum of two events can never exceed the sum of their 
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P(A + B) < p(A) + p(B) 


(but p(A + B) > p(A) and p(A + B) > p(B), since A CA+ Band BC A+B 
by the very definition of the sum of events). Similarly, for every k arbitrary 
events (not necessarily mutually exclusive), we have 


p(Ay + Ap +... + An) < p(Ay) + p(Ag) +... + p(An). 


The inequality p(A + B) < p(A) + p(B) can be made slightly more precise. 
We define the product of two events A and Bas an event wherein both the events 
are realized simultaneously and denote it by AB. Let us consider m, (corres- 
pondingly m,) equally probable outcomes of an experiment in which the event 
A (correspondingly B) occurs; we assume that there occur precisely ! outcomes 
contained both in the m, outcomes favourable to the occurrence of A and mg 
outcomes favourable to the occurrence of B. Jt is obvious that both the events 
A and B are simultaneously realized if and only if one of these / outcomes 
occurs. Hence p(AB) = [/n. Onthe other hand, if exactly | outcomes are 
contained both in the m, outcomes favourable to the occurrence of A and the 
m, outcomes favourable to the occurrence of B, then in all we have m, + m, — I 
outcomes (since the sum m, + m, contains | outcomes which are thus counted 
twice). Therefore, 


and, consequently, 


p(A + B) = p(A) + p(B) — p(AB). 


It is seen that the problem of determining the probability of the sum A + B 
of events A and B reduces to the evaluation of the probability of the product AB 
of these events. The latter problem is not quite simple in a general case and it 
will be considered in the next section. However, there is a particular case in 
which the evaluation of the probability of event AB does not present any diffi- 
culty, This is the case in which A and B are independent events, i.e., the case in 
which the result of an experiment with which the occurrence or nonoccurrence 
of the event A is associated is in no way influenced by the conditions of an 
experiment result the event B is connected. Thus, for instance, the events involved 
in drawing a black ball from different urns containing black and white balls 
are independent, but two successive draws of a black ball from one urn (with- 
out replacement of the ball drawn) are not independent events (since the result 
of the first draw alters the number of balls left in the urn and, hence, is reflect- 
ed in the conditions of the second experiment). 
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Suppose that the event A occurs in m, out of nm, equally probable outcomes 
of the first experiment and, independently of this, the event B occurs in me out 
of nz equally probable outcomes of the second experiment. Then, the probabi- 
lity of A is m,/n, and that of B is m./n,. We now consider a compound experi- 
ment consisting of both the experiments under discussion. It is obvious that 
this compound experiment can have n,n, distinct equally probable outcomes, 
since to each of n, distinct outcomes of the first experiment we can associate 
distinct n, outcomes of the second experiment. Of these mn, equally probable 
outcomes, m,m, equally probable outcomes are favourable to the occurrence of 
AB. These are obtained by combining the m, outcomes of the first experiment 
favourable to A with the m, outcomes of the second experiment favourable to 
B. The probability of the event AB is thus given by 


and, hence, 
P(AB) = p(A) p(B) 


(the multiplication law of probabilities). 

This law can now be generalized as follows. Suppose that A, A,,..., Az 
are any k mutually independent events, i.e. the conditions of experiments to which 
the outcome of a particular event is related depend in no way upon the occur- 
rence or nonoccurrence of the remaining events. In such a case, 


p(A,A,.. . Ay) = p(A,) p(Ae) . . - p(Ax). 


The proof of this relation is exactly the same as the derivation of the formula 
p(AB) = p(A) p(B), which forms its particular case. 

If the events A and Bare not independent, then the multiplication law p(AB) = 
= p(A) p(B) is not guaranteed. For example, if B C A (say, A is the appearance 
of an even number in a die roll and B that of a two), then the event AB coin- 
cides with B and, consequently, p(AB) = p(B). In fact, we can only assert that 
P(AB) & p(A) and p(AB) < p(B) (since from the definition of the product of 
events it follows that AB C Band ABC A). The question concerning the 
evaluation of the product of two arbitrary events will be dealt with in more 
detail in the next section. 

A few problems now follow to make obvious the applications of simple pro- 
perties of probability we have deduced. 


Problem 5. A coin is flipped 2 times. What is the probability that a head 
occurs on both the flips? 

We seek here the probability of the events AB where A is the occurrence of a 
head on the first flip and B is the occurrence of the same face, that is, a head on 
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the second flip. The events A and Bare obviously independent. Hence, 


p(AB) = p(A) p(B) = + + a 4 


(see Problem | on p. 3). 


Problem 6. We select at random a positive integer not exceeding 1000. What 
is the probability that the selected positive integer can be expressed as a power of 
another integer (with exponent greater than unity)? 

The term ‘at random’ in the formulation of this problem implies that we 
regard the appearance of any number between | and 1000 to be equally probable. 
Furthermore, since 


2° < 1000 < 219, 36 < 1000 < 3’, 54 < 1000 < 55, 68 < 1000 < 64, 
7? < 1000 < 74, 10? = 1000 < 104, 11? < 1000 < 113, 
12? < 1000 < 123,..., 31% < 1000 < 313, 32? > 1000, 


the probability that the selected integer will be a power of 2 is 8/1000 (among 
1000 integers between 1 and 1000, there are 8 which occur as power of two: 
2? = 4, 23 = 8, 21, 25, 28, 27, 2°, 2°); in exactly the same way, the probabilities 
that the selected interger will equal the number 3, 5, 6, 7, 10, 11, 12, 13, 14, 15, 
17, 18, 19, 20, 21, 22, 23, 24, 26, 28, 29, 30 and 31 raised to the integer power 
greater than 1 are correspondingly given by 


5 3 2 2 1 1 1 


2 
1000’ 1000’ 1000’ 1000’ 1000’ 1000’ 1000’°°*’ 1000 


(if the numbers raised to a power were 4, 8, 9, 16, 25 and 27, then they constitute 
simultaneously a smaller number raised to a greater power; hence, these cases 
have been excluded). Since, all the corresponding events are pairwise incom- 
patible, the desired probability is given by 


8 5 3 2 2 2 
io00 * 1000 * tooo + Tooo + Too * 000 
1 \ 1 40. 1 
+ to00 * Yooo * °°° + yooo * Tooo = 25 
Aaa Se nS 


18 times 


Problem 7. In a 52-card deck, cards of one of the four suits are specified as 
trumps. What is the probability that a card selected at random is either an ace 
or a trump? 

Suppose that the event A (correspondingly B) is that the card drawn be an ace 
(correspondingly, trump); then the event AB is that this card be an ace of trumps 
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and p(A) = 33, (each suit of the deck contains 13 cards : two, three, ..., ace), 
p(B) = 3, p(AB) = sx. Hence, the desired probability is given by 


1 1 1 4 
P(A + B) = p(A) + p(B) — P(AB) = Gy + GO ay = TR" 


Problem 8. Six hunters saw a fox and simultaneously shot at it. It is assumed 
that each of them, as a rule, hits the fox at a given distance and kills it in one out 
of three chances. What is the probability that the fox is killed? 

Suppose that A,, A,,..., A, denote the events that the fox is killed by the 
Ist, 2nd,..., 6th hunter. In the hypothesis of the problem, it is indicated that 


P(A,) = p(A2) = «++ = P(Ag) = 35 


it is required to find p(S), where S = A, + A, +... + Ag. The events A,, 
Ag, ..., Ag are Obviously independent; this enables us to solve this problem by 
multiple use of the formula 


p(A + B) = p(A) + p(B) — p(AB) = p(A) + p(B) — p(A)p(B) 


(see the discussion below in small type). However, such a solution is not simple 
since the formula expressing the probability of the sum of several (compatible) 
events is fairly complicated. 

The following version of the solution of this problem is more convenient. Let 
us first determine the probability p(S) that the fox escapes. A miss by the Ist, 


2nd,..., 6th hunter is naturally denoted by A,, Az, ..., Ag, by the formula 
P(A) = 1 — p(A), we have 
p(Ai) = p(s) = +++ = p(Ag) = 3. 


In order for the fox to survive, it is necessary that all the hunters miss the 
target, i.e., the problem here relates to the probability of the product events A,, 


A,,..., Ag where all the events A,, 42, ..., Ag are mutually independent. 
Thus, 

= —— — — — 2° <2 2 26 64 
p(S) = p(A,Az. . . Ag) = p(A;) »...* p(Ag) ec Gar: ; 3 = 36 = 729° 


and recalling the formula p(A) = 1 — p(A), 


64 665 
POSTS 39) 729 
The formula 


P(A + B) = p(A) + p(B) — p(AB) 
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can be extended also to the case of determining the probability of the sum of an arbitrary 
number k of (possibly, compatible) events A;, As,..., Ay. We have 

P(A, + Az + Ay) = P{(A, + Aa) + A;} = p(A, + Ay) + P(A) — P(A; + A;) A;}. 

Here 

P(A; + Ax) = p(A,) + p(A2) — P(A, A). 


Let us now explain the meaning of the more complicated expression p{(A, + A,) As}. By the 
definition of the sum and product of events, the event (4, + A.)A, consists of the occurrence 
of at least one of the events A, and A, and simultaneously the event A, also occurs. But this 
means that at least one of the events 4,4, and A,A, occurs. Thus, we have 


(A, + A,)Ay = A,A, + A, As, 
and, consequently, 
P{(Ay + A2)As} = p(A1Ag + Aadg) = p(AiAy) + p(A2rds) — p{(A1Ag)(A2As)}- 


Furthermore, the event (4,A4,)(A,A,) consists of the simultaneous occurrence of both the 
events 4,4, (i-e., A, as well as A,) and A.A, (A_ as well as Aj). In other words, the event 
(A,A;)(A2Ag) consists of the simultaneous occurrence of three events A,, A, and Az, i.e., it 
does not differ from the event 4,A, 43. 

We thus finally obtain 


P(A, + Az + As) = P(A) + P(As) — P(A, A2) + P(Ag) — P(A1As) — p(A2A5) + P(A1A2 4g), 
or, in a different order, 
P(A, + Az + As) = P(A) + P(A2) + P(As) — P(AiA2) ~ p(A1As) — P(A2A5) + P(A1A245). 
Proceeding on lines similar to this, for arbitrary k, we have 
P(A, + A, +... + Ax) = P(Ay) + P(A2) +... + P(Ax) — P(A142) — ptA,As) 
—... — P(Ap_iAg) + p(A,A,Ay) + p(A,A,Aq) 


+... + p(Ap_2Ap_1An) — P(A1 424544) 
—...t¢ (— DE P(A,A, .. . Ax). 


This formula can be easily proved by induction, following a procedure similar to the one 
demonstrated above for a case when k = 3. 


Let us now solve Problem 8 with the aid of the formula deduced. For k = 6, 


P(A, + Ap +... + Ag) = P(Ay) + P(A) +... + P(Ag) — P(AiA2) — P(A As) 
—...— p(AsAg) + p(A1A245) + p(Ay A244) +... 
+ p(AgAsAg) ee al P(A\A,A,AgA5 Ag). 


But (since all the events A,, Az,..., A, are mutually independent), 


P(A,) = p(A2) =... = p(Ag) = + 
1\2 
p(AiA,) = p(AyAy) =... = p(As Ag) = P(A1)P(A2) = (3) : 
] 3 
P(A, AAs) = . . . = P(AgAsAg) = P(A1)P(A2)P(As) = (3) ee 


1 \8 
p(AyA, ... Ag) = p(Ar)p(As) - . - P(Ag) = (3) 


1.2. PROPERTIES OF PROBABILITY 15 


hence we get 


WArt Art... + A= 6-4-C(9 5) +e(3 NG) 
~c(4Xa)+e(s a) -eCs Ms) 
ry 
“(1 dyer (HS. 
ie. the same result as above. 


Other examples of the applications of this general formula can be found in [40]. 


We now turn to the concepts of the ‘sum’ and ‘product’ of random variables, 
which will be put to good use in the sequel as well. We illustrate the former by 
the following example. 


Problem 9. Two different lathes are installed in a workshop, manufacturing 
identical parts. It is known from experience that the first (older) lathe may 
produce up to three defective parts in a day, the probability of the number of 
defective parts being given here by: 


Number of defective parts (per day) 0 1 2 3 
Probability 0.3 04 0.2 O.1 


The second (new) lathe produces not more than one defective part per day. The 
probability that at most one of the parts manufactured in a day is found defective 
is in all equal to 0.1: 


“Number of defective parts (per day) 0 I 
Probability 0.9 0.1 


The question is: What is the average number of defective parts manufactured per 
day in the workshop. 

In this problem, we simultaneously consider two random variables « and 8. 
The former variable « takes the values do, a,, a,, and a, (precisely, 0, 1, 2 and 
3) with the probabilities py, pi, pz and p; (which are equal in our case to 0.3, 
0.4, 0.2 and 0.1; obviously py + py + Po + p; = 1, as it must be). The latter 
variable 8 takes only two values bo and 5, (namely, 0 and 1) with the probabi- 
lities g and q, (in this case 0.9 and 0.1; clearly, g, + q, == 1, as it must be). 
The mean values of these random variables (i.e., « and 6) represent the average 
number of defective parts produced in a day, respectively, by the first and second 
lathes; they are correspondingly given by 


M.V. & = Poy + PQ, + Pod, + patgg = 0.3°-04+04-14+02-2 
+0,1-3 = 14, 
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and 
m.v.B = bo + Gb) = 09-04+01-1=0.1. 


We are, however, interested in the random variable « + 8, representing the num- 


ber of defective parts produced in a day by both the lathes. This variable can 
take the values 


Go + bg, &y + by; a, + dg, a, + dy; ay + do, @, + by; ag + by and ay + By 


(in the present case, the values 0, 1, 2, 3 and 4). We shall assume (for a while !) 
that the random variables « and § are independent, i.e., we assume that the ran- 
dom variable « takes the values 0, 1, 2 and 3 with the probabilities po, p., Do 
and ps (i.e., 0.3, 0.4, 0.2 and 0.1) irrespective of the value which is taken by the 
variable 8 (for the same day). Then, the events « = a; (i = 0, 1, 2 or 3) and 
B = bj (j = 0 or 1) are also independent, and hence, 


p(a = a; and B = Bj) = p(« = ai) - p(3 = By) = pigqy. 


This yields the following (detailed) probability table of the random variable 
a + f: 


Values a+b ath a+b a+b ath ath ath a+ by 
(=9 (=)! (=!) (=4 (©=% (=) (=) ©49 
Probabilities Polo Pot P19 P31 


P24 Pq P34 
(= 0.27) (= 0.03) (= 0.36) oe 0.18) (= 0.02) (= 0.09) (= 0.01) 
Hence 


m.v. (@ + B) = pog, (Gy + bo) + pod: (po + by) + Pigo (a, + Bo) 

+ Pygy (@, + Bi) + Pogo (Gg + 5e) + Pr9i (2, + 5,) 
+ Psqo (a3 + by) + psqi (@s + 4,) 

= Ay (Poo + Pots) + 4:(P19a + Pid) + G2 (Pega + 2M) 
+ @3(D39q + psd) + bo (Pode + Pr9o + P2Ga + PsQo) 
+ by(pagi + Pid + 291 + PLN) 

= Ada + 91) + aPilGa + 91) + GepalGo + %) 
+ G3P3(Ga + 91) + boGol(po + Pi + D2 + Ps) 
+ bigi(Po + Pi + Po -+ Ps) 

= (Agpq + Gp, + pz + Asp3) + (be%o + 5191) 

= m.v. « + m.v. 8 = 1.2 (defective parts per day). 


It is thus seen that the mean value of the sum of two random yariables is the 
sum of their mean values. 

However, it is worthwhile to note that the last conclusion obtained via suffic- 
iently tedious algebraic transformations is in fact quite elementary. Suppose that 
on a specific day, say the first day, the first lathe produces a® defective parts 
(where a”) equals 0, 1, 2 or 3) and the second lathe produces b" defective parts 
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(where b equals 0 or 1). Let it be further assumed that on the second, 
third, ... , nth day, the first lathe produces a®), a, ..., a'") defective parts 
and the second lathe produces b@), b®,..., b" such parts. Then, the total 
number of defective parts produced on the first, second, third,... , mth day is 
given by 


qa) + hb), a?) + 5), q®) + 5), Pigid og qi”) + bi), 


and the mean number of defective parts produced per day is given by 


(a + b")) + (a) -- b'2)) + (a) + b'?) + ie (a + bi”) 
n 


da ek ali eo ta J 
ee ee ye i UN re — ge 


But, for large n, the value 


(a® +. Bb). ae (a'?) + b'2)) + (a'3) + 5°) fo... + (a) +- b'™) 
n 


will be very close to the m.v. (« + 8), and the values 


a’) + g(t) 4 gi3) +... 4 glind 
n 


and 


b1) + pl) 4+ pl 4... + pin) 
n 


tom.v. «and m.v. 8. This fact obviously implies that 
m.v. (« + 8) = mv.a + mv. 8B. 


It is noteworthy that the conclusion established by the preceding simple 
reasoning is more general than that proved algebraically earlier? In fact, in this 
reasoning we did not have to rely on the independence of the variables « and 8 
(which, as a matter of fact, is not tenable rather often in practice, because the 
operations of both lathes can be affected by certain common factors such as, 
e.g., use Of the same raw material by both lathes). Therefore, it is impossible 
to assert in a general case that 


p(= = a and B = bj) = p(« = a;) - p(B = bj) = pig. 


Hence, in place of the values poqo, Po7: and so on, the second row of the prob- 
ability table of the random variable « + 8 will contain certain probabilities poo 
(the probability that « = a, and 8 = bp), po, (the probability that « = a and 
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B = b,) and so on whose numerical values depend on the relationship between 
the variables « and B, usually unknown to us in all details. 

This situation, however, has almost no impact on the calculations adduced 
earlier. In fact, we now have 


m.v. (a + B) = Poo(ao + bo) + Parlay + by) + prolar + 5p) + Pia, + By) 
++ Pool@, + bo) + Poi(a, + 5) + psoas + bo) + par(as + 5,) 
= Ad Poo + Por) + 4:(Pie + Pu) + A2( Poo + Per) 
-+ pa(P30 + Par) + 5o(Poo + Pio + Poo + Pao) 
+ bi(por + Pir + Per + Pai). 


But 
Poo + Por = P(® = ao and B = bo) + p(a = a and B = By) 
= p(« = a) and 8 = bo or Di). 


However, 5, and b, represent all possible values of the random variable B and 
hence p(a = a, and B = By or b,) is nothing more than simply p(a = a») = py! 
In precisely the same way, it is established that 


Pio + Pur = Pir Poo + Poi = P2, P30 + Psi = Ps. 
Furthermore, 


Poo + Pio + Poo + Pao = p(% = ao and B = by) + p(x = a, and B = bo) 
+ p(a = a, and B = by) +- p(a = a, and B = dy) 
= p(a = a or a, or a, or a, and B = by) 
= P(B = bo) = 4, 
and similarly 


Pa + Pia -+ Par + Par = %- 


Thus, as before, in this case we have 


m.v. (« + 6) (aoPo + 4,Pi + Gep2 + spy) + (b0gq + 6,91) 


m.v. « + m.v. 8. 


| 


I 


The results obtained can, of course, be extended to any number of random 
variables that likewise satisfy the condition that the mean value of their sum be 
equal to the sum of their mean values. 

We now return to the notion of the product of two random variables and put 
this to work in the following example. 


Prob!em 10. Every year a farmer sends ao, @,, a; or @y calves to a market and 
the probability ( frequency) of a specific number of calves being sold is given by 


Ee nee ee eee 
Number of calves a a, a, a, 
Probability Po Pi P. Ps 
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(where, of course, py + py + p2 + ps = 1). On the other hand, the price 
fetched by a calf in different years may equal to either by or b,, the probability of 
these prices being, respectively, equal to qq, and q, ( = 1 — qa): 


Price of calf — By b, 
Probability I 


Find the farmer’s mean annual receipt from the sale of calves. 

Here, we are again concerned with the two random variables « and 8. Retain- 
ing an analogy with Problem 9, we adhere to the same symbols as above and 
denote the possible values of these variables and the probabilities of these values 
by aq, 41, @,, @,; bo, b, and po, Pi, Po» P33 Jo» M1. Now, we are interested in the 
product «B of these two variables (the product of the number of calves sold and 
the price fetched by a calf), which can take 8 values aby, @ab1; 2150, 21b,; Aabo, 
@,b,; @ybq, a3b,. In addition, if we consider « and 8 to be independent, then the 
probability table of the variable «8 has the form 


Values G0 a,b, ab, a,b, ab, a,b, a,b, a,b, 


Probability = Dodo = Po PDS P91 Py P21 Pn SPQ 


Hence, the mean value of af in this case is given by 


m.v. (08) = PoGaGodo + PoG14oP1 + Prgodibo + P19iGib1 + P2qotabo 
+ P2Gideby + p39adgbo + PaGidsb, 
= poto(Joba + 91b1) + p14i(qaba + 915) 
+ p2A(qobo + Abi) + ps4s(Gobo + 915,) 
= (pada + Pid: + Prd, + pss)(Gobo + 9b) 
= (m.v. «) - (m.v. 8). 


It is thus seen that, for independent random variables « and 8, the mean yalue 
of their product always equals the product of the mean values of these variables. 
The same principle also holds for any number of mutually independent random 
variables; here also the mean value of the product of all variables equals the pro- 
duct of the mean values of all the factor variables. 

It may be remarked that in contrast to the case of the sum of random vari- 
ables, in the case of the product the independence of factor variables is an essen- 
tial condition, without which the results stated above can be found to be false. 
To illustrate this, it suffices to consider the case in which a, = a, = «, where « 
is characterized by the following probability table: 


Values of the variable a +1 —l 
Probability 0.5 0.5 


a Oe ee el 
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In this case it is obvious that 
Mm.v. & = mv. a, = 0.5( +1) + 0.5( —1) = 0, 
so that 
(m.v. a) X (m.v. a) =0 xX 0=0. 


It is also evident that the variable «4, x «, = «? is always equal to -+1 (since 
(+1)? = (—1)? = +1), so that 


m.v. (4%) = 1 > 0 = (m.v. «,) x (mv. a). 
The inequality 
m.v. (a2) > (m.v. «)? 


established by this example will be revisited in Sec. 4 of this chapter. 


1.3. Conditional probability 


Two events A and B are called independent, if the result of the experiment to 
which A is related has no influence on the realization of the experiment with 
which B is associated. However, this situation does not always hold at all. An 
example substantiating this statement has been given earlier and will be reiterated 
here in detail. Suppose that event A consists of drawing a black ball from an urn 
containing m black and 2 — m white balls and event B of drawing a black ball 
from the same urn after one ball is drawn. It is obvious that, if the first ball 
drawn is black, i.e., if A occurs, then after the first draw, m — 1 black anda — m 
white balls are left in the urn and, hence, the probability of event B is (m — 1)/ 
(n — 1). If, however, the first ball drawn is white (namely, the event A occurs), 
then m black and n — m — 1 white balls are left in the urn, and the desired 
probability equals m/(z — 1). The probability of event B thus varies according 
as A is realized, or not, i.e., here the probability of event B can take two diff- 
erent values [(m — 1)/(n — 1)] and [m/(# — 1)], for which it is necessary also to 
prescribe separate notations. 

The probability of event Bin the case when it is known that event A has 
occurred is called the conditional probability of B on the hypothesis that A has 
materialized and is written as pa(B). Thus, in our case pa(B) = (m — 1)/(n — 1). 
Similarly, we define the conditional probability pq(B) of B under the assumption 
that 4 has occurred (i.e., under the assumption that A has not occurred); in our 
case pq(B) = [m/(n — 1)]. 

It is also obvious that the conditional probability p4(B) of any event B under 
the assumption that A has occurred can be obviously either less or greater than 
the unconditional probability p(B) of this event (i.e., the probability of B when 
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nothing is known about the result of the experiment involving A). Thus, in the 
example considered above, it is clear that p(B) = m/n, since it is possible to 
anticipate beforehand with equal probability that in the second draw any of the 
n balls contained in the urn will be drawn, and out of these n balls precisely m 
are black. Thus, here 


m— 1 
PREP, 


m 
n— 1 


< = = p(B) and py(B) = > <= = p(B). 


If A and B are independent events, then, obviously pa(B) = p(B). The last 
specification can be regarded as a precise mathematical definition of the notion 
of independence of events and it enables us to verify for any pair of events A 
and B whether they are independent or not (see in this context, the specific 
example given at the end of this section in small type). 

Conditional probabilities can be calculated quite similarly the way we comput- 
ed unconditional probabilities in Sec. 1. Suppose that event A has N equally 
probable outcomes of an experiment favourable to it, which permit us to deter- 
mine the occurrence or nonocurrence of A and also of a certain other event B. 
Out of these N outcomes, let M be favourable also to B so that the remaining 
N — Mare not favourable to B. In this case 


pa(B) = a ( and pa(B) ~*—M ) 


Thus, for instance, in the example examined above, the experiment consisting 
of the successive draw of two balls from an urn with n balls, has n(n — 1) equally 
probable outcomes (in the first draw, any of n existing balls may be drawn and 
in the second; one of the remaining n — 1). Out of these n(wz — 1) outcomes 
there are N = m(n -- 1) outcomes favourable to A (the first draw resulting in 
one of m black balls followed by any of the remaining n — 1 balls); moreover, 
of these m(n — 1) outcomes favourable to A, those favourable to B are M = 
m(m — 1) (the first draw resulting in any of m black balls and the succeeding 
one in any of the remaining m — 1 black balls). Consequently, 


Let us now call K the total number'of equally probable outcomes of an 
experiment with which is associated the occurrence of two events A and B. Since 
out of these K outcomes M are favourable to the occurrence of both A and B, 
the probability of the event AB, i.e., of the occurrence of both A and B, equals 
M/K. However, M/K = (N/K) x (M/N), but M/N = pa(B) and N/K = p(A) 
(because out of K equally probable outcomes, N are favourable to A). 
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Consequently, we have 


P(AB) = p(A) pa(B). 


This is also the general rule for the determination of the probability of the pro- 
duct AB of two events, usually called the multiplication law of probabilities (the 
multiplication law of Sec. 2 being a particular case). Thus, in order to find p(AB), 
it is necessary to know the conditional probability p(B), which characterizes 
the relationship existing between the events A and B. Therefore, the probability 
of AB is in general not determined by both the probabilities p(A) and p(B). Only 
in the case in which the probability of B is not affected as a result of the o¢cur- 
rence or nonoccurrence of event A, i.e., in which A and B are independent, we 
have pa(B) = p(B) and p(AB) = p(A) p(B), the conclusions that we obtained 
above. 

From the definition of conditional probability, we immediately deduce the 
following properties: 


(a) 0 < pa(B) < 1; p(B) = 1, if A C B (in particular, if B is the certain 
event); pa(B) = 0, if A and B are incompatible (in particular, if B is the 
impossible event); 

(b) if BC B,, then pa(B) < pa(B,); 

(c) if B and C are incompatible, then pa(B + C) = pa(B) -+ pAa(C); if B, 
By, ..., Be are pairwise incompatible, then 


pa(B, + By + +++ + By) = pa(B,) + pa(B,) + - ++ + paC By); 
(d) pa(B) = 1 — pa(B). 
The proof of these properties is completely analogous to the proof deduced in 
Sec. 2 for these very properties for ordinary (unconditional) probabilities. 
We further note that the formula p(AB) == p(A) pa(B) implies 
pB(A) _ pa(B) 


P(A) p(B) 
(since it is obvious that the events AB and BA are identical ones). This implies, 
in particular, that knowing the probabilities p(A) and p(B) of two events A and 
Band the conditional probability p4(B) of B under the assumption that A occurs, 
it is possible also to determine the conditional probability ps(A): 


P(A) | 
p(B) 


p(A)pa(B) = p(B)ps(A), or 


Ps(A) = pa(B) x 


Thus, in the urn example analyzed above, p(A) = p(B) = m/m (the prob- 
abilities of drawing a black ball in the first and in the second draw both equal 
min); hence, ps(A) = pa(B) = (m — 1)/(m — 1) (here pa(A) is the probability of 
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drawing a black ball in the first draw if it is known that the second draw results 
in a black ball). 

We finally remark that since either one of the events A and 4 necessarily 
occurs, the sum of events AB (i.e., ‘B and A’) and 4B (‘B and J’) coincides with 
the event B. But since 


P(AB) = p(A) pa(B), p(AB) = p(A) pa(B), 
and 
p(AB + AB) = p(AB) + p(AB) 
(events AB and AB are clearly incompatible, because A and J are so), then 
P(B) = p(A) pa(B) + p(A) pa(B). 


Thus in the example under consideration, 


P(A) = — pid) = » pa(B)= ~—;, Pa(B) =~» 
and 
7 mm-—l1 n—-m m m 
P(A) paB) + P(A) palB) = OF 5 = CB) 


Quite similarly, if any experiment « can have k (and only k) pairwise incompa- 
tible outcomes 4,, Ao,.... , Ak, then every event Bcan be expressed as the sum 
of events 4,B + A.B + -++-+ AxB. Hence 


P(B) = p(A,)pa(B) +- p(A2)pa(B) + + + + 4+ p(Ax) Pax(B). 
This equation is called the equation of total probability. 


Problem 11. There are three urns : urn 1 contains 2 white and 4 black balls, urn 
2 contains 4 white and 2 black balls and urn 3 contains 3 white and 3 black balls. 

A ball is drawn at random from an urn (which urn is not known). What is the 
probability that the selected ball is from the first urn if it turns out to be (a) white, 
(b) black ? 

Suppose that event A (resp. event 4) be that the selected ball is white (resp. 
black). Moreover, let event B be that the the ball is removed from the first urn. 
Our experiment of drawing a single ball can have 3 X 6 = 18 outcomes (accord- 
ing to the total number of balls in all three of the urns) which we regarded to be 
equally probable (in other words, the drawing of a ball from any of the urns is 
considered to be equally probable). Of these 18 outcomes, 9 are favourable to 
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A and of the last 9 outcomes 2 are favourable to B. Of these 18 outcomes, 9 
are favourable to 4, too, but of these 9 outcomes 4 are favourable to B. Thus, 


we have 


2 
pa(B) = ° and py(B) = +. 


Problem 12. A word, ‘papagay’ is formed by letters of an alphabetic section. 
Then, cards with the letters are well mixed and any four of them are drawn one 
after another in succession and arranged in a row. What is the probability of 
obtaining the word ‘papa’ by this procedure ? 

Suppose that events A, B, C and D be, respectively, that the first letter drawn 
is ‘p’; the second ‘a’; the third ‘p’ and the fourth ‘a’; then the event in whose 
probability we are interested can be written as ABCD. Further, by applying 
consecutively a few times the formula for the probability of the product of two 
events, we have 


pA) = ¥, 
p(AB) = p(A) pa(B) = $ x 3 = 4, 


and, finally, 


1 2 1 
p(ABCD) = p(ABC)pasc(D) = 35 x oA 70° 


Problem 13. We have 5 urns, of which two urns each contain 1 white and 5 
black balls; one urn contains 2 white and 5 black balls, and, finally, each of the last 
two urns contains 3 white and 5 black balls. Anurnis chosen at random and a 
ball is drawn at random from it. What is the probability that the selected ball is 
white ? 

We denote by A,, A, and A; the events such that the ball is drawn from an 
urn containing, respectively, one, two, or three white balls. Then, 


2 ] 2 
P(A) = 33 P(A,) = => and p(A,) = 5 


Further, if B is an event such that the selected ball is white, then by the equa- 
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tion of total probability, we have 


p(B) = p(Ay) X pa,(B) + p(Az) X pag B) + p(As) X pag(B) 


We conclude with a simple example to demonstrate the application of the definition of in- 
dependent random events given on p. 21. We consider a regular tetrahedron of homogeneous 
material, with the digits 1, 2 and 3 inscribed on its three faces and all these digits together on 
the fourth face (see Figure 2). Let A, B and C denote events such that the throw of the tetra- 
hedron results in showing the face with digits 1, 2 and 3, respectively. It is, thus, obvious that 


Fig. 2. 


p(A) = p(B) = p(C) = }. Indeed, the tetrahedron may fall on any one of its faces with the 
same probability and each of the digits appears precisely on two of its four faces. If it is now 
known that event A has occurred, then it signifies the appearance of the face of tetrahedron, 
inscribed with either digit 1, or showing the three digits 1,2and3. In addition, both the events 
B and C are realized in the latter but not in the former case. Consequently, p4(B)=pa(C)=}, 
so that 


PA(B) = p(B) and p,(C)= p(C). 
Hence both A and B, and A and C are independent, which yields also 


P(AB) = p(A) p(B) = z p(AC) = p(A)p(C) = + 


(see the multiplication law of probabilities for independent events on page 11). Similarly we can 
verify that the events B and C are also independent, i.e., here too we have pa(C) = p(C) = }. 

From the example adduced, we can also infer that pairwise independence of every pair of 
events among A, B and C does not imply the independence of all the three of them taken to- 
gether, i.e., the validity of the equation 


P(ABC) = p(A)p(B)p(C) 


(cf. p. 11). Itis, in fact, obvious that in our example the simultaneous occurrence of A and 
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B implies also the occurrence of C, so that here we Lave 


1 1 
PaB(C)=1 and p(ABC) = p(AB) paB(C) = a l= 77 


while 


— 


1.4. The variance of a random variable. Chebyshev’s inequality and the 
iaw of large numbers 


A very important characteristic of a random variable is, of course, its average 
(mean) value. With the aid of mean values, we can compare two random vari- 
ables; thus, for example, between the two marksmen (see Problem 4, p. 7) the 
better shot is naturally the one who scores a‘ higher mean number of points. 
There are, however, many problems where the knowledge of merely mean value 
of a random variable supplies very scanty information about the variable. We 
consider, for example, a cannon aimed to hit a target clamped into a vise at a 
distance a km from the cannon (Fig. 3). If we denote by «(km) the firing range 
of the shell, then the mean value of «, as a rule, equals a; the deviation of the 
average value from a testifies to the presence of a systematic error in firing 
(systematic error in the flight of the shell beyond, or short of, the target), which 
can be eliminated by suitably changing the inclination of the barrel of the 
cannon. However, the absence of a systematic error does not at all guarantee high 
accuracy in firing. To evaluate accuracy, it is also necessary to know how close 
the shells come to hitting the target (since the equation m.v. « = a only signifies 
that the shell on the average overshoots the target as often as it falls short of it). 


Fig. 3. 


How do we determine the accuracy in firing (and compare the performance 
of two cannons aimed at a target)? The deviation of shells from the target is 
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given by the number « — a; however, the mean value of the random variable 
« — ais evidently zero: 


mv. («—a)=mv.2—-a=a—a=0Q0, 


which is obvious since the mean sum of positive and negative values of « — ais 
zero. It is plain that a nice characteristic of the ‘spread’ is the mean value of 
| « — a] (where the vertical lines denote, as usual, the absolute yalue of a num- 
ber); however, mathematicians do not have much liking for the absolute value 
of numbers, since it is of little use for further algebraic transformation. Hence, 
it is usual to characterize the spread of a random variable as the mean value of the 
square of its deviation from its mean value: in fact, the square of both positive and 
negative numbers is always positive, and no cancellation of the deviations occurs 


here. The number thus obtained is called the variance of the random variable 
a; 


Var. ¢ = m.v. (@ — a)? (= mv. (« — mv. @)*). 


The variance of « is the most commonly used measure of the spread or disper- 
sion (or deviation from the mean value)} of «. It is obvious that, in the case of a 
cannon aimed to strike a target, we consider that cannon to be most appropriate 
for which the variance of «, the range of flight of the shell, is least (here it is 
assumed that the cannon is so regulated that the average range of flight of the shell 
coincides with the distance a from the cannon to the target). 

It is easy to comprehend that for the random variable «, characterized by the 
accompanying probability table 


Value ay a, eas a 
Probability p; Pe tae Pr 
LT 


the mean value a is given by 


a=m.v. « = pia; + podg ++ ** + Drak 
(see, above, p. 6), and the variance is defined by 
Var. « = m.v. (2 — a)? = p,(a, — a)* + p.{a, — a)* + +++ + pelar — a)*. 


tlt is obvious that if, as in the above example, the random variable « has km as its unit of 
measurement, then its mean value is also measured in km and variance in km?. Hence, with 
variance we frequently consider also a number which is the square root of the variance of a 
random variable. This number is called the standard deviation of a random variable : 


standard deviation of a = +/ Var. a; 


it is measured in the same units as the random variable « and also serves as a measure of the 
spread of its values. 
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The last equation can also be set up in a somewhat different form. We note that 
(~ — a)? = a? — 2aa + a?. 

Hence, since the mean value of a sum of (random) variables is the sum of their 

mean values (see p. 18), 


Var. = mv. (a — a)? = mvv. («? — 2ax + a?) 
== m.v. «2 + m.v. (—2ax) + m.v. a’. 


However, a’ is not a random variable but a number having a completely definite 
value}, hence 


m.v. a = a’. 


On the other hand, the random variable —2a« is obtained from the random 
variable « by multiplying all its values by —2a; hence its mean value also is 
obtained by multiplying the mean value of « by —2a: 


m.v. (--2a«) = —2a X m.y. « =: —2a X a= 2a’. 


Thus, we finally get 


Var. « = m.v. «2 + mv. (—2aa) + m.v. ua? = mv. «2 — 2a? + a? 
= m.v. a2 — a? = mv. (a) — (m.v. a)’, 


i.e., the variance of a random variable is equal to the mean value of its squares 
minus the square of its mean value. But, the variance of a random variable is 
always non-negative (for this is the mean value of the variable (« — a)’, all of 
whose values are non-negative). It follows from this that the mean value of the 
square of a random variable is never less than the square of its mean value (see 


p. 20). 


Problem 14. The acompanying tables of probabilities (frequencies) of the number 
of defective items (per thousand) is assigned to two identical lathes: 


First lathe : Number of defective items (per thousand) 0 1 2 3 4 
Probabilities 0.1 02 04 O2 O.1 

Second lathe : Number of defective items (per thousand) 0 1 2 3 4 
Probabilities 0.15 02 025 O03 O.1 


{By a? we can of course understand a ‘random variable’ with the accompanying probability 
table 


Values a 
Probabilities 1 


which implies that 
m.v. a? = 1 x a? = a’, 


i.e., the mean value of a constant is equal to that constant, 
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Compare the mean numbers of defective items produced by the first and the second 
lathe and the variances of these variables. 


It is easy to see that the mean number of defective items produced by the first 
lathe («) and those by the second lathe (3) are identical: 


mva=O0.1 x0+02x%1+04x24+02x3401x4=2, 
and 
mv.8 = 0.15 x 0-+-02x1+025x*2+03 x3+4+0.1 x 4=2. 


From this aspect, both lathes can be regarded to be equivalent. But the variance 
of the variable « is less than that of @: 


Var.« = 0.1 x (0 — 2)? + 0.2 x (1 — 2% + 0.4 x (2 — 2)? 
+ 0.2 x (3— 2)? +01 x (4 — 2)? = 1.2, 


and 
Var. B = 0.15 xX (0 — 2)? ++ 0.2 x (1 — 2)? + 0.25 x (2 — 2)? 
+ 0.3 x (3 — 2)? + 0.1 x (4 — 2)? = LS. 


This signifies that production by the first lathe is more ‘stable’, because here the 
number of defective items produced in different lots of a thousand items is more 
densely clustered around the mean value 2 than in the case of second lathe. 


We now note that the variance of the sum of two independent random variables 
is always equal to the sum of their variances. In fact, suppose that « and 6 are 
two independent random variables, i.e., such that the probability of an individual 
outcome of one does not at all depend on a particular value taken at that experi- 
ment by the other. In this case, as we know (see pp. 15-19), if 


m.v. « == a and m.v. B = 5, then m.v. (« + 6) = a -+ band m.v. («B) = ab. 


Together with « and 8, we consider two more random variables «? and 87, whose 
values are the squares of the values of « and 8; for them also we have 


m.v. (#? + 8?) = m.v. a2 + m.yv. 8. 
Further, 
Var. %® = m.v. a? — a®; Var. B = mv. @? — B’, 
and 


Var. (« + 8) = m.v. (« + 8)? — [m.v. (« + P)}? = m.v. (« + 8)? — (a + b)? 
= mv. (a? + 208 + 8?) — (a? + 2ab + B?). ) 


However, since the mean value of a sum of random variables is the sum of their 
mean values, we have 


m.v. (a? + 228 + 8?) = mv. « + mv. (228) + mv. B?. 
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But since the random variable’ 28 is twice the random variable «8, it follows 
that 


m.v. (2a¢8) = 2 m.v. («#B) = 2ab. 


Thus, we finally obtain 


Var. (« + 6) = (m.v. a + 2ab -|- m.v. B?) — (g? + 2ab + Bb?) 
= (m.v. «? + m.v. 6?) — (a? + 5’) 
= (m.v. a? — a”) + (m.v. B? — 5?) = Var. « + Var. 6. 


It is obvious that also for an arbitrary number of pairwise independent random 
variables, the variance of their sum is equal to the sum of their variances. How- 
ever, for non-independent random variables, this is no longer so. Suppose, for 
example, that «, and «, are one and the same random variable « with the mean 
value a, then a, + ¢, = 2«. In this case, obviously, 


m.v. (2a) = 2 m.y. a (i.e., m.v. (2; + a) = m.v. a, + M.v. ag). 
However, 


Var. (20) = 4 Var. « (ie., Var. (a, -+- %,) = 2 Var. a, + 2 Var. a,), 


since 


Var. (2«) = m.v. [24 — m.v. (2«)]? = m.v. (2a — 2a)? = m.v. [4(« — a)*] 
= 4m.v. (« — a)? = 4 Var. «. 


Problem 15. A firm manufactures some items, each individual item having the 
definite probability p of yielding defective pieces (say, p = 0.002 = 0.2 per cent). 
Assuming that all items independently of each other, in some lot of a thousand, 
are found defective with probability p, find the mean value of the number of defec- 
tive items per 1000 items produced and the variance of this variable. 

We denote by a, (where i = 1, 2,3,..., or 1000) a random variable which 
assumes the values 1 or 0 according as the ith piece is or is not found defec- 
tive; in such a case all the 1000 variables have one and the same probability 
table : 


Values 1 0 


Probability p 1-—-—p 
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Hence, 
m.v.a = p xX 1+ (1 — p) xX 0 = p(= 0.002), 
and 


Var. a, = m.v. a? — (m.v. a)? = [p X 1 + (1 — p) X OI] —p? =p — p® 
= p(l — p) (= 0.002 x 0.998 = 0.001996). 


However, the variable «, we are interested in, is here the sum of all variables 
as, i.e., 
& = &y + ay + ag + ++ + arapo. 
Moreover, all the variables «; are mutually independent by hypothesis. Hence, 
M.v.® = m.v. a + mv. & + +++ + MLV. X99 = 1000 p(= 2), 
and 
Var. a = Var. a, + Var. a +--+ + Var. &q99 = 1000p(1 — p) (= 1.996). 


The solution of the problem in hand makes use of the fact that the mean value 
and the variance of a sum of n mutually independent random variables «,, %2..., %, 
with the same mean value a and the same variance d are, respectively, equal to 
n-times the mean value and the variance of a single variable a, : 


m.v. (a, + 42+ --+ + on) = amv. a, = na, 
and 
Var. (a, + % +--+ + On) =n Var. a, = nd. 


In particular, if « is the number of occurrences of a certain event A in a 
sequence of n mutually independent trials, the probability of the occurrence of A 
in each trial being p, then 


m.v.a = np and Var. « = np(l — p). 


From what has been stated above, there emerges a consequence that is quite 
_ Often useful. We consider the arithmetic mean 


= a + a +--+ + an 
n 


of n mutually independent random variables with the same mean value a and the 
same variance d. Since all values of the variable «,, are 1/n the corresponding 
yajues of the variable a, + a, +--+ + oq, the mean value of a is also 1/m 
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times the mean value of the suma, + a2 +--+ + a, ie., 
1 
M.V. my = —- (na) = a. 


The variance of the random variable ,, is, however, 1/n? times the variance 
of the variable a, + « +--+ + on (cf. the assertions above on p. 30 about the 
variances of « and 2x); hence, 


Var.¢, = — = 
m re 


Thus, the mean value of the arithmetic mean of n mutually independent random 
variables with the same mean value and variance is equal to the mean value of 
each of these variables; the variance of the arithmetic mean is, however, less by a 
factor of \/n the variance of each of the random variables under consideration. 

The inference deduced may now be illustrated by an example. Suppose that 
we are called upon to determine, to the greatest possible accuracy, the value of 
some physical quantity a (for concreteness, we may conceive that the question 
is, say, the determination of some distance on a plane). The result « of a single 
measurement of the quantity a may be regarded as a random variable, since there 
always exists a definite probability of error due to the inaccuracy of the measur- 
ing instrument and carelessness in measurement; in this case, the absence of a 
systematic error in measurement means that 


mvc —@ 


(cf. p. 26). We now carry out, say 20 independent measurements and form the 
arithmetic mean a,, of the a, %,..., %» results cf these measurements. In 
this case 


M.V. &, = Mv. % = @, 


i.e., the values of the variable «,,, the same as those of «, cluster around the true 
value a of the measured quantity. However, since 


Var. ty = a Var. a, 


the spread of the value of «,, is considerably less than that of the values of a. 
Hence, equating a to the value of «,,, we are fully justified in expecting that a 
larger error is now considerably Jess probable than in the case in which the 
result « of a single measurement is actually a. Thus, for instance, if we mea- 
sure on a plane a distance of the order of 100 m, an error of 1-2 m is often 
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completely possible; however, the arithmetic mean of 20 independent measure- 
ments deviates here almost always from the true value by considerably less than 
1 m. 

The last remark leads us straight to a remarkable inequality whose derivation 
is the principal aim of this section. Since Var. a, < Var. «, we assume that 
the probability of a considerable deviation of the variable «,, from its mean 
value a is less than the probability of a large deviation of « from the number 
a=m.v.a. This conclusion can be justified rigorously on the basis of the 
following fundamental result : if x is a random variable with mean value a and 
variance d, then we always have 


Pla —a|>e)< &. (*) 


Here « is an arbitrary positive number and the expression P (| « — a| > e) 
denotes the probability that the deviation of the value of « from its mean value 
a is greater than e. The inequality (*) is called Chebyshev’s inequality; it shows 
that the less the variance d of the random variable «, the less is the probability 
of a large deviation of « from the number a = m.v. «. 

Chebyshev’s inequality (*) is a particular case of another inequality (this also 
is usually called Chebyshev’s inequality), involving an arbitrary random variable 
8 which takes only non-negative values. To be precise, if 8 takes only non-negative 
values and the mean value of 8 is b then, no matter what the positive number c, 


P(8>c)< 2, Ce) 


where P (8 > c) is the probability that 8 takes a value greater than c. The in- 
equality (*) obviously follows from (**). In fact, it is only necessary to choose 
as 6 the non-negative random variable (« — a)? whose mean value, by definition, 
is the variance d of the variable «, and to note that the condition |« —a|>e 
is equivalent to the condition (« — a)? > e*; then (**) transforms into (*). 
Hence, it will suffice for us to prove (**). 

Let us assume that the probability table for 8 has the form 


Values b, 6b, by vee ba 


Probabilites Pi P Ps iets Pr 


In this case, 
b = m.v. 6 = Pid, 7 Dob, + P4b3 + eye + Pnbn. 


Assume that the possible values of the variable 8 listed in the above table are 
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numbered in an ascending order, so that b, < b, < bj <--- < b,. Of these, 
suppose that by is the first value that exceeds c (i.e., the values b,, be, ... , Di-y 
are all less than or equal to c and Dy, by4,,... , by, are all greater than c); since 
all values of 8 are non-negative, the sum on the right-hand side of the preceding 
equation for 5 cannot be enlarged if the summands p,bi + pybe + +--+ + 
Px-be-; in it are discarded. Consequently, 


bem pebe + Drs bear +++ + pabn. 
For all values bx, bx+;,..., bn on the right-hand side of the last inequality, 
we now substitute the number c, which is less than these values. Our sum will 


then get further reduced, and hence 
b> Pee + Pessre +--+ + pac = (pe -b Perr +--+ + pade. 
Thus, we arrive at the inequality 


b 
Pet Pei +++ + prs? 


‘which precisely coincides with the inequality (**), since the sum pe + Pet, + 
+ --+--+ pp of the probabilities of those values of 8 which exceed c is also 
exactly equal to P(® > c). 


We now recall the random variable «,,, the arithmetic mean of n independent 


random variables a,, «),..., %, with exactly the same mean value a and vari- 
ance d: 
a, +o, +--+ + a, 
SS 
n 


It is seen from above that 
n d : 


Now applying Chebyshev’s inequality (*) to the variable a,,, we obtain 


P| Gm — a] De) < —4- car 


Thus, for instance, suppose that we are given 20 independent measurements 
spread around a 100 m (such that the average value a of the result of each of these 
Measurements also equals 100 m). Assume that the variance of each measure- 
ment is close to 2m?. In other words, it is presumed that the squared error of 
of each measurement is on the average equal to 2, i.e., the absolute value of the 
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error of each measurement is usually of order 1-2 m. In this case, the formula 
(***) when « = | m, yields 


2 


= Of. 
Thus, the probability that the arithmetic mean of these 20 measurements deviates 
from the true value of the distance by more than | m is here necessarily less 
than 0.1.7 

We further note especially that if « is the number of occurrences of an event A 


during n independent trials, the probability of whose occurrence in a single trial 
is p, then 


Tc 
_P(|a —np| > ne) < a 
or, equivalently, 
ae oss pl — P) aK 
(|= p|>«) < _ (#4) 


[since it is shown on p. 31 that m.v. « = np and Var. « = np(1 — p)] for every 
e > 0. This implies that for every number ¢ > 0 (no matter how small !), 
we can choose a number n of independent trials so Jarge that the probability 


P(|= — | >) 


of the deviation of the frequency «/n of the realization of the event A in a series 
of n successive trials from the probability p of the occurrence of A in a single 
trial by more than e becomes arbitrarily small. In fact, for any p and e, the 
ratio p(1 — p)/ns® appearing on the right-hand side of inequality (****) tends to 
0 as 1 > co, and this implies that, for sufficiently large n, it is arbitrarily small. But, 
in real life situations, we usually ignore events having sufficiently small proba- 
bilities, regarding them as ‘practically impossible.’ Let us note, however, that the 
importance of not making a wrong inference influences considerably the decision 
as to how small the probabilities are that are considered to be small enough to 
imply that the corresponding events may be taken as impossible ones. Hence, the 


tit should also be kept in view that Chebyshev’s inequalily (*), as well as the related in- 
equality (***), are highly approximate: the real value of probability at the left-hand sides of 
these inequalities is most often much less than the corresponding right-hand side. Thus, for 
example, applying more complex methods, it can be shown that, in the example considered 
by us, the value P(| am — 100| > 1) is actually less than 0.C02. 
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last conclusion means that for every positive e, we can find N so large that the in- 
equality n > N practically guarantees that the deviation of frequency «[n from 
the probability p is less than «. This consequence, which substantiates the identi- 
fication of the probabilities of random events with their frequencies, as set forth 
at the start of this chapter, is called the law of large numbers (since it is related 
to the selection of a large number N of trials). 

A similar deduction can be made also from inequality (***), which is more 
general than inequality (****). Namely, it follows from (***) that for every 
arbitrarily small positive number « we can always choose a sufficiently large 
number n of random variables «,, «2,..., , (in other words, a sufficiently 
large number of observations or trials), such that it guarantees us that the 
probability P(| «,, — a| > e) will be sufficiently small. In fact, for every « 
(and every fixed value of d) the right-hand side d/e? of inequality (***) also 
tends to zero as n increases indefinitely. Thus, for every « > 0, we can, by 
means of the choice of a sufficiently large number n, guarantee the practical 
reliability of the inequality | «,, — a|< «. The general statement that, for a 
sufficiently large number of similar independent trials (i.e., independent trials 
leading to numerical results that have the same mean value and variance), the 
arithmetic mean of their results «1, %,..., &n can be made as close as desired to 
the mean value a of the variables 0, %,..., %n, 1S also called the law of large 
numbers. 

In fact, we may even dispense with the requirement that the mutually indepen- 
dent random variables «,, «,, «3, ... involved in the determination of the quan- 
tity «,, should have the same mean value and variance. Indeed, if the mean 
values of these random variables are a, dj, a,,...and their variances d,, d;, 
d,,... are bounded (i.e., there is a number D such that d; < D for every /), 
then from Chebyshev’s inequality (*) it follows that 


H+ a tits +a 


P(|%, —- @,| > #) < - = 


ei? where i 


This implies in turn that for every number « > 0 we can, by choosing a sufficiently 
large number n, practically guarantee that the inequality | %&q— a4, | < «€ is 
Satisfied. This assertion is also another form of the Jaw of large numbers. 


1.5. Algebra of events and general definition of probability 


In the earlier section, a key role is played by two opera‘ions, which associated the two 
events A and B and a certain third event; we have designated these operations as the sum 
and product of A and B, written A + B and AB (see pp. 8 and 11). Some justification for 
these names is provided by the fact that the rules of ‘addition’ and ‘multiplication’ of even's 
strongly remind us of the rules of addition and multiplication of numbers. Thus, from the 
very definition of the sum and froduct of two events it follows that d + B = B + A and 
AB = BA; at one place we also made use of the equality (A + B)C = AC + BC (sez p. !4). 
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The present section aims to analyse more closely the points of similarity and dissimilarity 
between the ‘algebra of events’ and the ‘algebra of numbers’. 

In arithmetic and algebra, we consider various sorts of numbers e.g. integers, rational, real 
(both rational and irrational) and complex numters. In every case, with each pair of numbers 
a and b, there can be associated two other numbers, their sum a + 6 and product ab. In 
addition, the rules involving addition, closely resemble the rules involving multiplication. 
Thus, for instance : 


a+b=b+a and ab = ba 
(a+ b)+ec=a+ (b+) and (ab)c = a(be). 
This analogy between the operations of addition and multiplication is also manifested in the 


existence of two idempotent numbers 0 and ! such that the addition of the former and the 
multiplication by the latter do not alter any number: 


a+0O-:a and axl=—a. 


This analogy, unfortunately, does not go very far. The reason is the asymmetric distributive 
law 


(a+ b)c = ac + be, 


where addition and multiplication appear in entirely different roles. Indeed, if, in the last 
equation, the addition sign is replaced everywhere by the multiplication sign and vice versa, 
we arrive at the absurd ‘equality’ 


axb+cec=(at+ec) x (64+ 0). 


Hence, many properties of addilion and multiplication are very different from each other. 
Thus, for instance, the number 0 plays a most critical role in relation to multiplication, as it 
is seen from the important equalily 


ax0=0 


(which implies, in particular, that the division of a number a, different from zero, by 0 is im- 
possible); in contrast to this the following analogue of the above equality involving addition 
does not obviously hold: 


a+l=l1. 


However, there also exist objects, other than numbers, for which we can define the opera- 
tion of addition and multiplication, sharing many prorerties of the addition and multiplica- 
tion of numters. In some of these cases, we obtain algebraic systems, where a greater close- 
ness than the one in the case of numbers prevails between the operations of addition and multi- 
plication defined in these systems. Let us, for example, consider a collection of all possible 
sets (or ‘figures’) of a plane. The sum A + B of two sets A and B is naturally defined as their 
union (Fig. 4a). Then, we obviously have 


A+B=B+4, 
and 


(A+ B)+C=A+4+(B+C) 
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(in the last equation the union of three sets A, B and C appears on the left- and right-hand 
sides, which can also be written simply as A + B + C, without parentheses). The role of zero 


B B 
A 
A 
(a) ()) 
Fig. 4. 
is played here by the so-called ‘empty’ set O, which contains no point; for such a set we have 
A+O=A, 


We now define the product AB of two seis A and B as their common part or intersection 
(Fig. 46). It is obvious, in this case, that 


AB = BA, 
and 
(AB)C =: A(BC) 


(in the last equation the common part of three scts A, B and C occurs on the left- and right- 
hand sides, and it is natural to denote it simply ABC). The role of unity is played here by 
the entire plane J. Indeed, for every set A we have 


AIL=A, 
It is easy to show that for an algebra of sels so defined, the distributive law 
(A+Byx C=AXxXCHBxXC 


holds. To prove this law it suffices to consider Fig. 5a, where sets A + Band C are disting- 
uished by two different types of shading, so that their product (intersection) (4 + B)C is 
shaded by double strokes; J denotes the product A x Cand // the product B x C. However, 
we have here the ‘second distributive law’ 


AxB+C=(A4+C) x (8+ C), 


which is obtained from the first one by interchanging the roles of addition and multiplication. 
For the proof of this, it suffices to consider Fig. 55, where the sets A + Cand B + C are 
shaded in two different ways, so that their product (A + C) x (B + C) is thatched with double 
strokes; the portion / denotes the set A x Band /I the set C. 

The duality between these two distributive laws completely defines the analogy between the 
rules involving the addition and multiplication of sets. Thus, for instance, it is obvious here 
that: 


AxOz=0O and 44+ 7=1. 
We can compare also the two equations 
Ax A=A and A+A=A, 


none of which holds in the algebra of numbers, 
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In arithmetic and algebra, an important role is played by the comparison of numbers 
according to their magnitude. If we regard < to te the basic sign of comparison (the relation 


(a) (h) 


Fig. 5. 
a < b means that the number a is not greater than the number b), then the basic rules of com- 
parison of numbers take the following form: 


a<a (every numter a is not greater than itself); 
if a<b and b <a, thena=b (if a is not greater than 5 and b is not greater than 
a, then a and 6b are equal); 
if a<b and b<e, thena<ec (if the number a is not greater than 6 and b is not 
greater than c, then a is not greater than ec). 
We can also introduce hee the comparison of sets, conventionally written as AC B (the sign 
C replaces the ‘comrosite’ sign <), if Ais a part of the set B (A can also coincide with the 
whole of B). Here, it is also obvious thatt 
ACA; 
if AC B andBCA, then A = 8B; 
ifACBand BCC, thenACC. 


Among other set-comparison rules, the following are noteworthy: 
ACA+B8B and ABCA, 


and also 
ACT and OCA, 


(The last relation says that an empty set O contains no point other than a point of the set A. 
This is true for every A, since O has no poini in it.) 

A salient difference between the algebra of sets and the algebra of numbers consists in the 
concern of the former with an additional operation, which puts in correspondence with every 
set A, a new set A (the complement of A). This operation is defined as follows: A consists 


+ We note one substantial difference between number and set comparisons. For every 
pair of (real) numbers a and 5, one of the two relations a < b and b < a is necessarily valid 
(even both may te valid if a and b are equal). In contrast to this, for a pair of sets A and B, 
none of the relations A C Band B C A is satisfied on many occasions. (A similar situation 
hoids also for complex numbers, if it is agreed upon, which happens in some investigations, to 
write a < 6 when the complex numbers a and 6 have the same argument and the modulus of 
a does not exceed that of b.) 


1S 


ha 


40 1. PROBABILITY 


Fig. 6. 


of all the points of a plane, which do not belong to A. ‘The main rules of this new operation 
are 


A+ 


if ACB, then BC A; 


and, finally, 


A+B=AxBand AXB=A+B 


(see Fig. 6, in which the sets A and B are shown with different types of shading, the set A + B 
is doubly shaded and the set A x B has at least a single shading). 

There are also many other collections of some objects for which it is natural to define the 
concepts of sum, product, and even the ‘ordering’ A C B as well as the ‘complement’ 4, 
which satisfy all the algebraic properties enumerated above. One example of this class is a 
collection of random events considered in Secs. 1-3; as is easy to see, all the properties of set 
algebra carry over to the algebra of events. Another example can be obtained if, instead of a 
set of points on a plane, we consider a set of elements of some other nature, say, a set of 
integers. If, in addition, by the sum and product of the sets A and B we understand, as before, 
their union and intersection (for example, if A, and A, are sets of numbers divisible by 2 and 
‘by 3, respectively, then the set A, + A, contains all even numbers and those odd numbers 
which are divisible by 3, while the set A,A, consists of all integers which are multiples of 6) 
and regard that A C B if A forms a part of B (say A, C Az, where Aj is a set Of numbers 
divisible by 4) and that A is a set of all integers not belonging to A (if A is a set of all prime 
numbers, then A contains all composite numbers and the number 1), and J and O are taken, 
respectively, as a set of a// integers and a set that has not a single number in it, then all the 
relations enunciated above remain valid. 

As one more example, we can consider a set of all divisors of a certain number N, which is 
not divisible by any square greater than 1 (to be specific, for N = 30, a set of numbers 1], 2, 3, 
5, 6, 10, 15, 30); if by A + Band AB we understand, respectively,: the /east common multiple 
and the greatest common divisor of numbers A and B, by A C B the relation that ‘A is a divisar 
of B’ and denote by O and J the numbers 1 and N (i.e., 1 and 30) and by A the number N/A 
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(in our case 30/A) then, as before, 


A+B=B;A and AB= BA, 
(A+ B)x C=AxC+BxC and AxB+C=(A+C)x (B+ C) 
ACA+B and ABCA, 
A+B=AxB and AXB=A+4+B, 
and so on. 


Finally, the most important example in this direction is furnished by the set of all logical 
propositions (i.e., all statements such that each has a meaning when it says that a proposition 
is true or false); this set is the object of study in mathematical logic. Here, by the sum A + B 
and the product AB one ought to understand the statements ‘either A or B’ and ‘both A and 
B’, respectively; by A C B, the fact that the truth of A implies also the truth of B (for short 
‘A implies B’); by A, the negation of A (the propostion ‘A is not true’); and by / and O the 
Propositions which are a fortiori true and false, respectively. In this caSe again, all the rela- 
tions described above are satisfied, which express the definite laws of logic. Thus, for example, 


AtAz=lI 


is the law of the excluded middie : in all cases, the proposition A is either true or false; the 
relation 


Ax A=O 


is the law of contradiction: the proposition A cannot te simultaneously both true and false. 

The versatility and importance of algebraic systems that possess all the properties enumerat- 
ed above motivated mathematicians to study them especially. At present, such systems are 
called ‘Boolean Algebras’f, named after George Boole, the celebrated English mathematician 
and logician of the nineteenth century, who was first to apply such an algebra in his researches 
in the field of logic. 

The elements of a Boolean algebra are generally not numbers. However, we often succeed, 
in associating with every element A, the number | A | or p(A) satisfying the following conditions 


0< p(A) <1; p(O)=90, pl) =1; 
if ACB, then p(A) < p(B); 
if A x B=O, then p(A + B) = p(A) + p(B). 


This number is called the absolute value or norm of A and the Boolean algebra itself in this 
case is called a normed algebra. By way of an example, we may mention a family of plane figures, 
belonging to a square with unit side (the square itself plays the role of the element J of this 
‘Boolean algebra), where area of Fig. A is taken as the absolute value or norm of A. Other 


a. +A Boolean algebra can be characterized as a collection of elements, where two operations 
A and A + Bare defined (associating with every element A, respectively a pair of elements 4 
and B, some eiement of the same set), having the following properties 
A+B=B+A, 
(A + B)+C=A+4+(B+O), 


A+BitA+B=A. 
All the remaining properties of Boolean algebra can be derived from these three basic proper- 


ties, if we define the ‘product’ AB as A + B, the relation A C B as the equality A+B=B, 
the elements 7 and O as the right-hand sides of equations A + A = Jand AA = O(A being 
arbitrary). 
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examples are a set of all divisors of integer N, where N does not divide the square of any integer 
(N may be, for example, a number 30); here by the norm of A is-understood logy A (in other 
words, log,,4). A collection of all propositions of mathematical logic can also be treated as a 
normed Boolean algebra if it is agreed to regard the absolute value (norm) of a proposition to 
be | if it is true, or O if it is false. An example of normed Boolean algebra is also the algebra 
of events, studied in Secs, 1—3; the role of absolute value or norm of an event A is played here 
by the probability p(A) of this event. 

The link between probability theory and Boolean algebra can be used as the foundation 
stone for the general definition of the subject matter of probability. Namely, we can assert 
that the theory of probability studies a collection of objects, which form a normed Boolean 
algebra; these objects are called events and the norm p(A) of an event A is called its probability. 
Thus, for example, in the ‘urn problem’ (or in every problem reducible to it) we actually 
consider a Boolean algebra of all possible sets which can be composed of n given elements 
(points). In addition, the sum and product of two sets (as also in all the examples below) are 
defined as their union and intersection; the norm is, however, defined by the condition that for 
every single element (i.e., isolated point) set it is equal to one and the same number | /n. How- 
ever, so very legitimate, from our new view-point, are the problems that arise from invoking 
the same Boolean algebra, though under more general conditions, that we equate the norms of 
isolated points with arbitrary positive numbers pi, Po, .. +, Py, Which satisfy the unique condi- 
tion p; + Po +--+ + py, = 1 (in particular, the problem of an imperfact die having a dis- 
torted form or having been made of inhomogeneous material reduces to a Boolean algebra of 
such type with m = 6). We shall encounter later also a case in which the elements of a Boolean 
algebra are all possible parts of a given segment AB, but the norm is defined as the ratio of the 
length of the part under consideration to the entire length of the segment AB (see Problem 22, 
Chap. 2). Quite similarly, it is sometimes useful to consider a collection of all sets belonging 
to some plane figure or spatial body and define the norm as the ratio of the area or volume of 
the corresponding set to the area of the entire figure or volume of the entire body (see, for 
example, ‘Experiments with infinitely many possible outcomes’ on pp. 27-30 in [40]. The ‘prob- 
lem of an imperfect die’ can also be generalized to all these cases, i.e., even when considering 
a Boolean algebra of all sets belonging to a given segment, or figure of a body, we can intro- 
duce a norm in a completely arbitrary manner with the only requirement being that it satisfy 
the conditions imposed above on the function p(A). We thus arrive at a new wide class of 
interesting probability-theorelic problems. 

If the italicized statement in preceding paragraph is taken as the definition of probability, 
then it implies that, in every problem related to this theory, the basic Boolean algebra must 
necessarily be determined beforehand (i.e., it must be indicated in one way or another in the 
conditions of the problem itself). The main problem of the theory of probability should then 
be regarded as the determination of the probabilities of compound events formed of the 
given basic or e!ementary events A, B, C, D,- - - by means of the operations of Boolean algebra 
(for example, of the event AB + BC + CA or (A + B x C)(A + D)) when the probabilities 
of these elementary events are considered to be known (just as the main problem in geometry 
consists of the calculation of some length or angles with respect to other original lengths or 
angles, assumed to be known; for example, the length of the hypotenuse of a right triangle 
with respect to the lengths of two legs of this triangle). In such an approach to probability 
theory (indicated first by the Russian mathematician S. N. Bernstein in 1917) the crucial prob- 
lem of the methods of evaluating the basic probabilities p(A), p(B), and so on, obviously 
remains open. However, the developing theory will have a practical value, only if these prob- 
abilities can be determined in such a way that they coincide with the empirical frequencies 
of the corresponding events in a long series of experiments. One posible way to determine the 
‘basic probabilities’ satisfying this condition is given by the ‘classical definition of probability’ 
adduced in Sec. 1, which rests on the concept of the ‘complete system of equally probable 
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outcomes of an experiment’ (this ‘classical definition’ was first introduced by P. S. Laplace). 
In other cases when such a complete system does not exist, we take recourse to different routes 
for determining the values of p(A); for example, via finding the approximate value of p(A) 
directly by means of the repeated performance of an experiment to which the occurrence of the 
event A is related. The heart of the matter, however, is that the methods of determining the 
original probabilities are not at all reflected in all the succeeding operations over them, which 
form the main content of the theory. 

We also note the situation that, in all the examples set forth above, we define Boolean 
algebra as a collection of sets composed of the points of a ‘super set’. This circumstance is not 
accidental; it is possible to show that such formulation of this algebra is possible in afl prob- 
abilistic problems. Proceeding from this, one can reckon even from the start that the basic 
object of study of the probability theory is not the normed Boolean algebra of all the possible 
events but a ‘set of al! possible elementary events’ whose various parts (subsets) are later 
identified as the ‘events’. In order to bring these arguments to their logical conclusion, it is 
simply necessary to assign a well-defined norm p(A) to a subset A of our ‘set of elementary 
events’ and prescribe the main requirements (axioms) which must be satisfied by the subsets 
under consideration and their norms, so that indeed we have a normed Boolean algebra. This 
approach to the axiomatic construction of probability theory (proposed by A. N. Kolmogorov 
in 1929-1932) has definite advantages over the method shown above in this section for invest- 
igaling more comp!ex and subtler questions of the probability theory. Therefore, this approach 
gained the widest popularity in modern times and is now most extensively used. However, 
we refrain from a deeper involvement with this topic in order not to be led far away from the 
main theme of the book. 


2 


Entropy and information 


2.1. Entropy as a measure of the amount of uncertainty 


The main property of random events, whose study is the basic object of this 
book, is a complete lack of confidence in their occurrence, which creates the well- 
known uncertainty about the outcomes of an experiment related to these events. 
However, it is fully obvious that the amount of this uncertainty is different in 
different cases. If our experiment consists of determining the colour of the first 
raven which we will see, then, of course, we can with almost absolute confidence 
consider it to be black. In fact, though ornithologists say that, in principle, 
white ravens are also existing, hardly anyone will entertain a doubt about the 
outcome of such an experiment. Somewhat less certain is an experiment which 
consists of ascertaining whether or not the first person, we collide with, will be 
left-handed; here also one can predict the result of the experiment without any 
hesitation, but the risk to fail in this prediction is still greater than that in the 
first case. It is considerably more difficult to predict beforehand whether the 
first person whose path we will cross in the street of a city will be a male or a 
female. But even this experiment has relatively smaller uncertainty as compared 
to, say, the one of indicating in advance who will be the winner in a tournament 
with twenty participants completely unknown to us, or what will be the number 
of the lottery ticket which will win first prize in a forthcoming draw. If we predict, 
say, that the first person we meet in the street will be a male, we still have a 
hope for the success of our conjecture, but hardly anyone will hazard a forecast 
in the penultimate or much less in the last case. 

For practical purposes, it is important to know how to evaluate the degree of 
uncertainty of highly diverse experiments, in order that we may have an opportu- 
nity to compare them from this aspect. To start with, let us consider the experi- 
ments that have k equally likely outcomes. It is obvious that the degree of un- 
certainty of each such experiment is determined by the number k : if, fork = 1, 
the outcome of an experiment is not random at all, then for k large, i.e., when 
a large number of different outcomes is involved, a forecast of the result of the 
experiment becomes very difficult. It is thus quite clear that the desired numer- 
ical measure of uncertainty must depend on &, i.e., it must be a function /(k). 
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In addition, for kK = 1, this function must reduce to zero (because in this case 
there is no uncertainty), and it must increase with increasing k. 

For a fuller definition of the function f(A), it is necessary to impose additional 
restrictions on it. We consider two independent experiments « and 8 (i.e., two 
experiments such that the outcomes of one has no effect on the probabilities of 
the outcomes of the other). Suppose that « and # have, respectively, k and / 
equally probable outcomes; consider also a compound experiment «6 consisting 
of the simultaneous occurrence of « and 8. It is obvious that the uncertainty 
of «8 is larger than that of «, since here the uncertainty of the outcome of 8 is 
added to that of a. It is, therefore, natural to assume that the degree of uncer- 
tainty of «8 equals the sum of the uncertainties charcterizing experiments « and B. 
But since the experiment «3 has obviously k! outcomes of equal probability 
(they are obtained by combining each k of the possible outcomes of « with the 
| outcomes of 8), we arrive at the following condition which must be satisfied 
by the function f(k) : 


Fk) == fk) + £O. 


From the last condition stems the suggestion that we take the number log k 
as a measure of uncertainty of an experiment that has k outcomes of equal prob- 
ability (because log (k]) = log k + log/). Such a definition of the measure of 
uncertainty also agrees with the conditions that it is equal to 0 for k = 1 and 
that it increases with increasing k.t 

We note that the choice of a base for the system of logarithms is immaterial 
here since by virtue of the well-known formula 


log,k = logya X logak 


a transition from one system of logarithms to another reduces to only the multi- 
plication of the function f(k) = log k by a constant factor (the factor of transi- 
tion log,a). In other words, such a transition is equivalent merely to a change 
in the unit of measurement of the amount of uncertainty and is, therefore, 
fundamentally a matter of indifference. In specific applications of a ‘measure of 
the uncertainty’ it is customary to use logarithms to the base 2 (in other words, 
to consider that f(k) = log.k). This means that we choose here, as a unit of 
the uncertainty, the uncertainty of an experiment that has two outcomes of equal 
probability (say, flipping of a coin to determine a ‘head’ or ‘tail’, or finding out 
the answer ‘yes’ or ‘no’ to a question apropos of which we can expect with 
equal justification the answer to be affirmative or negative). Such a unit of 
measurement of uncertainty is called a binary unit (abbreviated to bit); in the 


fitis easy to show that a logarithmic function is a unique function of the argument k, 
which satisfies the conditions f(k/) = (k) + fl), fl) = 0 and f(k) > fil) for k > | (see Sec. 4 
below). 
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German literature it is known also by a more expressive ‘Ja-Nein Einheit’ (yes- 
no unit). Such a ‘yes-no unit’ is in a certain sense most natural; Chapter 4 will 
further elaborate upon the considerations that led to its adoption in engineering. 
We shail also use binary units (bits) throughout in what follows; thus, the expres- 
sion log k (where, as a rule, the base of the system of logarithms is omitted) 
usually denotes log, k in this book. It is, however, worth noting that in the 
content of this book there would be practically no change if we were to use the 
more common decimal logarithms; this would only imply the choice of a unit 
for the measurement of uncertainty of an experiment that has 10 outcomes of 
equal probability (such, for example, is an experiment that consists of drawing 
a ball from an urn with ten numbered balls or an experiment involving the find- 
ing of a digit if each of the ten digits were to have the same probability of being 
thought of). This last unit for the measurement of uncertainty (called the 
decimal uait or dit) is roughly 34 times greater than the binary unit (since log, 10 
= 3.32 = 33). 


The probability table for an experiment that has k equally likely outcomes has 
the form 


Outcomes of experiment A, A, A, oe A, 
1 


_ 1 1 
Probabilities ic : E oh 


Since we agree that the total uncertainty of such an experiment is log k, it can 
be considered that every individual outcome with the probability 1/& introduces 
an uncertainty equal to 1/k log k = —1/k log 1/k. But, then, it is natural to 
regard that in the case of an experiment with the probability table 


Outcomes of experiment Ay Ay A; 


: spe 1 1 
Probabilities z 3 ar 


the outcomes A,, A, and A; introduce uncertainties, which are, respectively, equal 
to —4 log 4, —4 log } and —} log ¢- If so, then the total uncertainty of this 
experiment is given by 


Quite similarly, we can assert that in the most general case, for an experiment 
« with the probability table 
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Outcomes of experiment A, A, Ay ies A, 


Probabilities P(Ax:) ~—  pA(z) PCAs) pais P(Ax) 


the measure of uncertainty is given by 


— p(A;) log p(A1) — p(A,) log p(A-) 
— p(A;) log p(A2) — - + - — p(Ar) log p(Ax) 


(see also Sec. 4 of this chapter in small print). We call this last number the 
entropy of an experiment « and denote it by A(«), thus following a deep physical 
analogy which there is no need to go into here.7 

We now study the properties of the entropy H(«). We note in the first place 
that it cannot take negative values : since we always have 0 < p(A) < I, it 
follows that log p(A) cannot be positive, and hence —p(A) log p(A) cannot be 
negative. We further note that, if p is very small, then the product p log p is 
also quite small, even though —log p is here a large positive number. In fact, 
for example, let p = 1/2"; then log p = —n and —p log p = n/2. It is clear 
that the fraction 1/2" for large n (which corresponds to small p = 1/2") is quite 
small (because with increasing n the number 2" grows much faster than n itself; 
thus, for examble, the number 2° consists of 20 digits)!f+ Hence, it follows 
that as p > 0 the product —p log p decreases unboundedly, so that 


lim (—p log p) = 0 
>O0 


(cf. Figs. 7 and 9 depicting the graph of the function y = —p log p; it is seen 
from these graphs that when p = 0 the value of this function is 0). Hence, if 
the probability p(A;) of the outcome A; is zero (i.e., the outcome A: is impos- 
sible), then the corresponding term —p(A,) log p(Ai) in the expression for en- 
tropy can be discarded without any qualms (strictly speaking, this term makes 
no sense, since log p(A;) in this case does not exist; just because of this we take 
recourse to finding the limit of the expression —p log p as p — 0). Contrarily, 
when p(A,;) is quite large (i.e., close to 1), the term —p(A;) log p(A:) is also 


TIf we relate the concept of entropy introduced here to the thermodynamic concept of en- 
trupy, it plays an important role in physics; see, for example, Brillouin [5] (cf. also Poletayev 
[18]. 

{{Many readers may be aware of a legend related to this, which states that the inventor of 
chess, when asked to name his reward, requeste:1 as many grains of food as would result from 
putting one grain on the first square on the board, two on the second and then on each 
succeeding square double the number of grains on the preceding one. This reward, as reckoned 
initially by its squares (64), was envis:oned to be quite modest; however, the corresponding 
number of grains (equal to 24 — 1) actually far exceeded the entire stock of fooigrain on 
earth. , 
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quite small, since log p tends to zero as p—> 1. If the probability p(A:) is precise- 
ly 1 (i.e., the occurrence of the outcome A, of our experiment is the certain 
event), then log p(A,) = 0 and hence also — p(A:) log p(A:i) = 0 (see again Figs. 
7 and 9). 

Since —p log p is O if and only if p = 0 or p = |, it is clear that the entropy 
H(«) of an experiment « is 0 if and only if one of the probabilities p(A1), p(A2), 

. » P(Ax) is 1 and all the others are 0 (recall that p(A,) + p(A,) + +++ + 
+ p(A,) = 1; see p. 9 above). This situation agrees well with the purport of 
the quantity H(«) as a measure of the uncertainty; in reality, it is only in this 
case that there is no uncertainty about an experiment. 

Furthermore, it is natural to consider that, among all experiments having k 
outcomes, an experiment « with the following probability table is most uncertain: 


Outcomes of experiment Ay Ag Ay ae Ar 


Gitze 1 I 1 1 
Probabilities r im " kas 7 


In fact, in this case it is most difficult to predict the outcome of the experi- 
ment. This corresponds to the circumstance that the experiment «, has the largest 
entropy: if « is an arbitrary experiment with k outcomes A,, Ao, ..., Ax, then 


H(«) = —p(A,) log p(A;) — p(Az) log p(Az) — - - - — p(Ax) log p(Ax) 
| | 1 1 1 1 
S log k = = Eee = 73 log | pore Oot % ey = A(%), 
k-times 
where equality is vaild if and only if p(4,) = p(A.) = + - * = p(Ax) = I/k. 


We defer a complete proof of this conclusion for the present (see Appendix I 
at the end of the book); here, however, we confine ourselves to illustrating the 


Fig. 7. 
related theorem by an example in which k == 2. In this case, the theorem re- 
duces to proving the inequality 


—p(A,) log p(A,) — p(A2) log p(42) < log 2 = I, 


where p(A.) = 1 — p(A,). 
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As already remarked, the value of the function F(x) = —x log x tends to zero 
as x — 0; on the other hand, for x = | also its value is zero, and for0 <x < 1 
this function is positive (because in this case log x is negative); for x > 1 the 
function —x log x is negative. The graph of the function under consideration 
is shown in Fig. 7, where OE = 1, OA = p(A:), OB = p(A,) and the segments 
AM and BN depict the variables —p(A,) log p(A,) and —p(A,) log p(A,). Since 


OA + OB = p(A,) + p(A.) = 1 = OE, 


the distance OS from the origin to the center S of the segment AB equals 34; 
hence in Fig. 7, the segment SP equals — 4 log 4 = 4. But the half sum of the 
segments AM and BN equals the middle line SQ of the trapezium A BNM, which 
does not exceed SP; consequently, 


4 (—p(A,) log p (A,) — p(A;) log p (42)) < 4, 


—p(A,) log p(A;) — p(Ag) log pl Az) < 1, 


where the equality holds if and only if the segments OA and OB both coincide 
with OS. Thus, it is shown that the function 


h(p) = —p log p — (J — p) log (1 — p), 


which defines the entropy of an experiment with two outcomes (whose prob- 
abilities are p and 1 — p), assumes its largest value (i.e., log 2 = 1) when p = 3. 
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The graph of this function is given in Fig. 8, which shows how the entropy h(p) 
varies for p varying between 0 and 1.7 

In the case of an experiment with k possible outcomes, the entropy is given by 
the formula 


A(pi, Po,» +5 Px) = —P, log py — pz log po — ++ + — px log Pr, 


where p,, Po, .- . » Px are the probabilities of individual outcomes, so that we 
always have p,; + ps + -++ + pe = 1. This is a generalisation of a case consi- 
dered above (because, when k = 2, the function H(p,, po...» Px) turns into 
H(p,, 1 — p:) = h(p,)); it can also be shown that the function H(p,, po... . 5 Dx) 
assumes its largest value, namely, log k, when p, = pp = +--+ = py = 1/k; for 
the proof, see Appendix I. In order to bring out the nature of the relationship 
between the function H(p,, po,..., px) and the individual probabilities p,, pe, 

. » Pk, We Consider again a graph of the function —p log p, 0 < p < 1 (see 
Figure 9, where a part of Figure 7 is reproduced to a somewhat larger scale). 


0 0.2 0.4 
Fig. 9. 


06 0.8 1.0 “p 


From this graph it is seen that, when p < 0.1, the quantity —-p log p grows 
extraordinarily fast; hence in this range a comparatively small decrease in the 
probability p: results in a highly significant decrease of the corresponding term 
—p; log p: in the expression of the function H(p;, Po,..., Px). This leads us 
to the fact that the summands —p; log p;, which correspond to very small 


fTables of values of the functions —p log p and h(p) = —p log p — (1 — p) log (1 — p) 
(logarithms are binary ones) are given in Appendices III and IV of this book. 
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values of the probability pi, contribute very little to the expression of H(p,, Po, 
e+. 5 Px) in comparison to other terms. Therefore, in calculating the entropy 
all the low-probability outcomes can often be disregarded without risk of any 
Significant error (cf. the text in small print on p. 59). Conversely, in a range 
between p = 0.2 and p = 0.6, where —-p log p assumes the greatest value, it 
changes comparatively evenly; hence in this range even a fairly significant 
variation in the probabilities p; has a comparatively small effect on the value of 
the entropy. We also note that from the continuity of the graph of the func- 
tion —p log p it follows that the entropy H(«) depends continuously upon the 
probabilities of individual outcomes of an experiment ¢, i.e., that, for very small 
variations of these probabilities, the entropy also varies very little. 


Problem 16. There are twu urns, each containing 20 balls, there being 10 white, 
5 black and 5 red balls in the first and 8 white, 8 black and 4 red in the secoad. 
One ball at a time is drawn from each urn. The outcome of which of these two 
experiments should be regarded as more uncertain ? 

The probability table for the corresponding experiments (we denote them by 
&, and a) has the form : 


(i) Experiment «, (draw of ball from the first urn) : 


Colour of the bail white black red 
- 1 ae a 
Probabilities > mr a 


(ii) Experiment «2 (draw of ball from the second urn) : 


Colour of the ball white black rea 
er 2 2 I 
Probabilities 5 5 = 


The entropy of the first experiment is given by 


ee Se oe Oe Cee 

H(4,) = —z log 5 — | los] — | oe | 
1 I 
Sie l+a xX 2 = 1.5 bits, 


but the entropy of the second is somewhat greater, given by 


Gy Se on Aegis yee th 
Tai ae CN ks RN 
= 3 x 1.32 + + xX 2.32 = 1.52 bits, 


H (a2) 


I 
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Hence, if we evaluate (as we agreed in order to accomplish this) the amount of 
uncertainty of the outcome of an experiment from its entropy, then we have to 
regard the outcome of the second experiment to be more uncertain than that of 
the first. 


Problem 17. Suppose it to be known from several years of weather observation 
that for a certain locality the probability that 15 June will be or will not be a rainy 
day equals 0.4 or 4.6, respectively. Further, assume that for the same locality the 
probability that on 15 November there will be rain equals 0.65, that there will be 
snowfall equals 0.15 and the probability that on this day there will be no preci- 
pitation equals 0.2. If, of all the weather characteristics, the question of the pre- 
sence and nature of precipitation alone is of interest, then on which of the two 
days, enumerated above, should the weather be regarded to be more uncertain in 
the locality under consideration ? 

According to what is understood here by the term ‘weather’, experiments a, 
and «,, which consist of determining the weather that will prevail on 15 June and 
15 November, are characterized by the following probability tables: 


(i) Experiment a, : 


ye 


Outcomes of experiment Rain Absence of precipitation 
Probabilities 0.4 0.6 


a TT A ST 


(ii) Experiment a, : 


Outcomes of experiment Rain Snowfall Absence of precipitation 
Probabilities 0.65 0.15 0.2 


Hence, the entropies of our two experiments are given by 


H(,) = —-0.4 log 0.4 — 0.6 log 0.6 = 0.97 bits, 


and 


H(a,) = —0.65 log 0.65 — 0.15 log 0.15 -—- 0.2 log 0.2 
= 1.23 bits > H(a). 


Consequently, in the locality in question, the weather should be considered 
to be more uncertain on 15 November than on 15 June. 

The result obtained obviously depends substantially on how the term ‘weather’ 
is interpreted; without making precisely explicit what is implied by it, our prob- 
lem in general has no meaning. In particular, if we are only interested in 
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whether there will be or will not be precipitation on a given day, the two out- 
comes ‘rain’ and ‘snowfall’ of experiment «, ought to be combined. For this, 
instead of «,, we have the experiment «, whose entropy is defined by 


H (a,) = —0.8 log 0.8 — 0.2 log 0.2 = 0.72 < H(a). 


Hence, with such an interpretation of weather, it is necessary to regard the 
weather to be Jess uncertain on 15 November than on 15 June. If, however, not 
only the precipitation but, say, the atmospheric temperature is also of concern, 
then the solution of the problem becomes more complicated and demands that 
we produce additional data on the temperature distribution in the given locality 
on 15 June and 15 November. 

The arguments developed in the solution of Problem 17 are of interest for an 
estimate of the quality of weather prediction by some method (the same situation 
holds for every other forecast). In fact, in an estimate of the quality of predic- 
tion, one should not take note of its accuracy alone (i.e., the percentage of cases 
in which the forecast is fulfilled); otherwise, it would lead to an over-estimation 
of every forecast that has great chance of being found correct (e.g., of a forecast, 
say, that there will be no snow in Moscow on | June, which is obviously of no 
importance). For a comparison of the quality of different forecasts, we ought 
to take note of not only their accuracy but also of the difficulty in making a 
good forecast, which can be characterized by the amount of uncertainty in the 
corresponding experiment. We shall again turn to this question later (see Prob- 
lem 21 in Sec. 3 of this chapter). 


Historically, the first steps in the formulation of the concept of entropy were 
taken as early as 1928 by Hartleyy, the American communication engineer. He 
suggested to characterize the amount of uncertainty of an experiment with k 
different outcomes by the number log k. He was, of course, well aware that 
this measure of uncertainty is quite convenicnt only in some practical problems, 
while in many cases it will be quite futile (and cven elusive). This is due to the 
fact that it completely ignores the distinctions among the natures of the occurr- 
ing outcomes (a most improbable outcome is given here the same importance as 
a highly likely outcome). However, he held that the distinctions among prob- 
able and unlikely outcomes are determined in the first place by ‘psychological 
factors’ and, therefore, should be taken into account only by psychologists and 
not be considered by communication engineers and mathematicians. 

The fallibility of Hartley’s viewpoint was shown by C. Shannon in 1948. He 
introduced the quantity 


H(2) = —p(A,) log p(A,) — P(A.) log p(A2) — «+ + — p(Ar) log p(Az), 


R. V. L. Hartley (928). Transmission of information, Bell System Tech. J. 7(3), 535-63, 
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as a measure of uncertainty of an experiment « with A;, A,,..., Ax possible 
outcomes, where p(A,), p(A2), . . . , p(Az) are the probabilities of individual out- 
comes; he also named this quantity ‘entropy’. In other words, Shannon assigns 
the uncertainty —log p(A,) to an outcome A, of the event « (in the case of k equally 
likely outcomes with probability p = 1/k, it leads to old Hartley’s suggestion 
to take the number log A = —log p as a measure of uncertainty). Furthermore, 
as a measure of the uncertainty of the entire experiment «, we take the mean value 
of the uncertainty of individual outcomes, i.e., the mean value of a random 
variable taking the values —log p(A,), —log p(A,),..., —log p(Ax) with prob- 
abilities p(A,), p(A2),... , P(Ax) [by the definition derived on p. 6 this mean value 
is precisely equal to H («)]. Thus, the perplexing ‘psychological factors’ intro- 
duced by Hartley are here taken into account by using the concept of probability 
having a purely mathematical (or more accurately, a statistical) character. 

The use of the quantity H(«) as a measure of uncertainty of the experiment « 
is found very convenient for a large variety of purposes; in the following, our 
main objective will be to make this situation transparent. It should, however, 
be borne in mind that the Shannon measure, as also Hartley’s measure, cannot 
lay claim to take into account all factors, determining the ‘uncertainty of an 
experiment’ in every sense in which it may be encountered in real life. Thus, 
for example, H(«) depends only on the probabilities p(A,), p(As), ... , p(Ax) of 
the various outcomes of the experiment but in no way depends on what these 
outcomes are, whether they are in a certain sense ‘close’ to or quite ‘remote’ from 
each other. Hence, our ‘amount of uncertainty’ will be the same for two random 
variables characterized by the following probability tables : 


SE TET TE YE, PY 


Values 0.9 1 1.1 Values — 200 1 1000 


and 


I 1 1 
Probabilities a a Rg Probabilities > 


or for two methods of treatment of a patient, of which one results in complete 
recovery of 90 out of 100 cases and appreciable improvement in the condition 
of the patient in the remaining 10 cases, and the second also achieves complete 
success in 90 out of 100 cases but then in the remaining 10 cases it is concluded 
by lethal outcome. The vital distinction between the two experiments in these 
cases has to be evaluated by other characteristics, different from Shannon’s 
entropy. 

The peculiarity of the entropy H(«) indicated, as also a Series of other singula- 
Tities of this quantity, stem naturally from the fact that the concept of entropy 
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first arose in attempting to solve particulary some problems intimately related 
to the transmission of information through communication lines and hence it is 
very suitable for percisely such uses. The situation is that for determining the 
time required for the transmission of some communication or the cost of such 
transmission, the specific content of the communication itself is altogether im- 
material! This is manifested in the entropy H(«) being independent of the values 
A,, Ag, ..-, Ag Of the outcomes of an experiment. On the other hand, the 
probabilities of individual communication are of great importance for communi- 
cation theory; we shall further elaborate on this in Chapter 4. Another pro- 
perty of great importance is that in the operation of communication lines a cru- 
cial role is played by statistical regularities, since through these lines there is 
always transmitted a large amount of information of various kinds. Hence, the 
measure of uncertainty used in the solution of problems related to communi- 
cation engineering must first be adapted to the evaluation of the amount of un- 
certainty of intricate ‘compound experiments’ consisting of a long series of trials 
following one after another, 

Let us also note that, from the viewpoint of an investigator studying the amount 
of uncertainty of such compound experiments, the difference between the treat- 
ments due to Shannon and Hartley is not found to be as striking as it might appear 
at the start. In fact, even if we look from Hartley’s standpoint, it is impossible 
to ignore completely the probabilities of the occurrence of outcomes, otherwise 
we could arbitrarily increase the number k of outcomes of our experiment, 
adding to really possible outcomes any number of fictitious outcomes of prob- 
ability zero. Hence, in the calculation of the measure of uncertainty of an 
experiment, according to Hartley, we should certainly reject all ‘impossible’ out- 
comes of zero probability. However, in addition, it is hardly worthwhile to take 
account of ‘practically impossible’ outcomes having negligibly small probability 
of occurrence. We now replace the experiment « with & distinct outcomes by 
another experiment «wv made up of a number of repetitions N (under identical 
conditions) of «. The number of distinct outcomes of aw is k%; these k" out- 
comes are Obtained by combining the k possible outcomes of the first perform- 
ance of a with the k possible outcomes of the second performance,...,k 
outcomes of the Nth performance of «. Hence, the amount of uncertainty of 
experiment ay, by Hartley’s measure is log kY = N log k, which again leads to 
the expression log k for the amount of uncertainty of « (because it is natural to 
consider that the amount of uncertainty of an event which consists of a number 
N of repetitions of «, must be N times greater than the amount of uncertainty 
of a; cf. a similar argument on p. 45). 

However, so far we have said nothing about the probabilities of our k" out- 
comes of the event ay. It is plain that if k outcomes of « are equally likely, 
then all k” outcomes of experiment aw are equally likely also, since here none 
of these k” evenis is distinguished by anything from the rest. If, however, k 
outcomes of « have the different probabilities p(A,), p(A2),..-» p(A,), then 
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kN = 2% los k outcomes of the compound event aw have also the different probab- 
ilities. It is found that, for large values of N, most of these 2% log * outcomes will 
have such negligible probability that even the sum of the probabilities of al/ such 
low probability outcomes is very small. As regards the remaining (more prob- 
able) outcomes of the experiment ay, the probabilities of all these outcomes for 
large N are almost indistinguishable from each other. Speaking more precisely, it 
can be shown that for sufficiently large N we can always discard some (as a rule a 
quite large !) portion of the outcomes of an experiment an, So that the total prob- 
ability of all the excluded outcomes is less than any quite small number chosen 
beforehand (say, less than 0.01, or 0.001, or 0.000001; the only requirement here is 
that the smaller we choose this number to be, the greater the number N should 
be) and all the remaining outcomes of the experiment an have practically the same 
probability. It is highly important in this case that the number of outcomes of 
the experiment ay, left over after such rejection, is found to be of order 2NH'\*, 
where H(«) = —p(A;) log piA,) — + > - — p(Ax) log p(A,) is the entropy of ex- 
perimentat}. Hence, it is clear that even from Hartley’s viewpoint it is natural 
to take the number log 2"°¥7'*) = N x H(«) as a measure of the uncertainty of 
the experiment a (because the outcomes, whose probability sum is negligible, are 
naturally discarded); in addition, for the amount of uncertainty of the initial 
experiment a, we obtain the value N x H(«)/N = H(«). Thus, it is seen that the 
treatment of Shannon differs from that of Hartley primarily in building up a long 
chain of repeated realizations of one and the same experiment «; the consideration 
of such a chain is typical of a probabilistic (statistical) approach. 

The statement, set above in italic letters, brings out the statistical meaning of 
the concept of entropy; it lies at the very root of most of the engineering applica- 
tions of this concept. However, a proof of this statement is not quite straight- 
forward; we defer it (and also a somewhat more precise formulation of the state- 
ment itself) to the last section, which is directly devoted to the applications of 
the concept of entropy to the theory of transmission of information. 


The real value of the concept of entropy stems primarily from the fact that the 
‘amount of uncertainty’ of an experiment expressed by it is found in many cases 
to be that particular characteristic, which has a role to play in diverse processes 
of the transmission and storage of various types of information encountered in 
nature and engineering. Later, we shall give a more elaborate exposition of 
some engineering applications of the concept of entropy; here, however, we shall 
present only one example of an entirely different variety. 

One of the basic problems dealt with in experimental psychology is a study of 
psychic reactions, i.e., the response of an organism to some stimulation or action. 


tHere, it follows in particular that if only all outcomes of the experiment « are not equally 
likely and, consequently, H(«) < log k, then the number of excluded outcomes form a domi- 
nating part of all the outcomes of ay (because the ratio 2N°#'®) ; ky — QN-H(e) , QN-logk _, 
2 N-(logk-#1(%)) is quite small for large N). 
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In addition, these reactions are classified into a simple reaction, some definite 
response to some assigned signal, and a complex reaction, the most important of 
which is the reaction of choice, in which different signals evoke different responses. 
It is known that the time necessary for a simple reaction of a person (i.e., the 
time interval between the stimulus and the reaction) does not depend usually on 
the nature of the stimulating signal (for mature people, its minimum value is close 
to 0.1 sec). A considerably more complicated problem is to ascertain the time 
necessary for a complex reaction. This depends substantially on the conditions 
of the experiment and primarily on the ‘amount of complexity’ of the reaction. 
As early as the 1880’s psychologists had explained that the average rate at which 
a person can react to a Sequence of random consecutive signals of k different 
kinds (provided that to each kind of signal he must react differently) decreases 
monotonically with increasing k. In order to verify this fact, a large number of 
experiments were carried out to dctermine the average time necessary fora chosen 
reaction, and these almost always yielded roughly the same result. The most 
usual setting of such experiments is a board before the subject on which one of 
k lights is flashed or one of & digits appears at definite intervals of time, and 
depending on the number of signals that appear he is to press one of the k 
buttons on which he had his fingers beforehand or utter one of the preassigned 
k words. A special device records the time transpiring between the appearance 
of the signal and the reaction of the subject to it; the dependence of the mean 
reaction time 7 obtained on the number k is also studied. 

It is natural to consider the mean reaction time as a definite ‘measure of un- 
certainty’ of the expected signal : the greater the uncertainty in the occurrence 
of the event, the greater is the time required to ascertain precisely what signal 
was delivered. The existing experimental data show that the mean reaction time 
increases with the increase of the number k of different kinds of signals roughly 
as log k, i.e., as the Shannon entropy H(«) of an experiment «, consisting of send- 
ing the signals (in all the experiments with which we are concerned here, the 
probability of signals of different kinds is always the same). For example, in 
Fig. 10 (taken from R. Hyman [47]) the circles show the data of eight experi- 
ments which were carried out to determine the average time required by the sub- 
ject, to indicate which of k lights was flashed. The number k ranged in these 
experiments from | to 8. The mean reaction time was determined from a large 
number of series of flashes, in each of which the flash frequency of all lights was 
identical, and the subject was already especially trained in similar experiments. 
In Fig. 10, the ordinate gives the mean reaction time and the abscissa the 
quantity log k; in addition, as is seen, all eight circles are laid sufficiently precise- 
ly along a single straight line. 

On the basis of this data, it is natural to surmise that the mean reaction time 
in all cases is determined by the entropy of an experiment « consisting of sending 
the signals. This, in turn, implies that the decrease in the degree of uncertainty 
of an experiment caused by replacing equally probable signals by not equally 
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probable signals must produce the same reduction of the mean reaction time 
as occurs when the number of different kinds of signals used is decreased, leading 
to the same reduction in the entropy H(«). This statement admits direct experi- 
mental verification, which substantiates it completely. Thus, in the same Fig. 
10, the squares plot the results of eight other experiments (carried out with the 


~ 


3 


Reaction time, 0.001 seo 
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0 1.0 2.0 
Entropy, bits 


Fig. 10. 


same subject as earlier) where & lights (with k equal to 2, 4, 6 or 8) were flashed 
with different relative frequencies p(A,), p(A.), ... , p(Ax), and the subject was 
trained beforehand for some time to a series of flashes at such frequencies. Here 
again, the mean reaction time T is plotted on the ordinate and the entropy 
H(«) = —p(A,) log p(4,) — p(A2) log p(A2) — . . . — p(Ax) log p(Ax) on the 
abscissa; it is found here that the squares are arranged quite accurately along 
the same straight line along which circles lie. It is thus seen that the entropy 
H(«) is indeed precisely that measure of the uncertainty of the outcome of an 
experiment which determines in a decisive manner the mean time required for 
a specific reaction to the advent of a signal. 

The reason for variation in the mean reaction time with variation in the re- 
lative frequencies of different signals is implicit in the fact that the subject reacts 
appropriately more rapidly to more frequently appearing (i.e., more familiar to 
him) signals but, on the other hand, more slowly to infrequent signals which are 
not expected by him. Obviously, these factors bear a psychological character. 
Nevertheless, it is seen that they can also be characterized quantitatively by the 
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value of entropy H(«) of an experiment «, despite Hartley’s misgivings that no 
‘psychological factors’ (which, however, according to his understanding had a 
considerably more vague relation with psychology than in the present example) 
can be quantitatively estimated. . 


At the end of this section we deduce some data, characterizing the insignificant value of 
numerous low-probability outcomes in determining the entropy of an experiment with many 
outcomes. 

We consider an experiment in which we are to select at random an English word consist- 
ing of four letters from a printed text. We can use here the data contained in the well-known 
‘Thorndike dictionary’ [167], which catalogues the frequencies of 20,000 most common English 
words, obtained by a statistical analysis of quite voluminous and varied English texts. This 
dictionary contains altogether 1550 four-letter words; accordingly, we can consider that our 
experiment « has 1550 different outcomes. We now calculate the entropy 


H(«) = —p(A,) log p(A1) — p(A2) log p( Az) — . . . — P(A1550) log P(Ais50) 


of this experiment, taking the probability p(A;) of each outcome to be equal to the frequency 
n;/N of the corresponding word; here n; is the number of the repetitions of this word, cat- 
alogued in Thorndike's dictionary, and N = m, + mg +... + M,559- It is found that this en- 
tropy is close to 8.14 bitsf. We shall now discard all words for which n; < 150; by doing so, 
there remain only 865 four-letter words, i.e., slightly more than 50 per cent of the original 
number (to be precise, 55.8 per cent). At the same time, a part of the sum H(a), correspond- 
ing to these 865 words, is equal to roughly 8 bits, i.e., forms more than 98 per cent of the 
entire quantity H(«). We now reject all words for which 2; < 750; by doing so, we are left 
with 395 words, i.e., in all about one-fourth (25.5 per cent) of the original number. However, 
to these 395 words there corresponds a part of the sum A(«), greater than 7.47 bits, i.e., con- 
stituting over 92 per cent of the entire quantity H(«). If we next exclude all words with 
ny < 1550, then we are left with only 2!4 words (13.8 per cent of the original number); how- 
ever, to these 214 outcomes there corresponds a part of the sum (a), close to 6.88 bits, i.e., 
comprising about 85 per cent of its original value. Finally, if we discard all words with 
n; < 3150, then altogether 119 four-letter words are left (7.7 per cent of the original number); 
however, to this 7.7 per cent of outcomes there corresponds roughly 78 per cent of the sum 
A(«) (this part of the sum A(«) exceeds 6.44 bits). 


2.2. The entropy of compound events. Conditional entropy 


Let « and 8 be two independent experiments with the following probability : 


(i) Experiment « : 


Outcomes of experiment A, Ag iis Ax 


Probabilities P(A,) P(Az)  ... ~~ p(Ax) 


LE ee PS SE IN, PE TIT PETE EET TEI 


{This value, as also all the accompanying numerical! data, are taken from [19). 
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(ii) Experiment B: 


SSE 


Outcomes of experiment B, B, yas B, 
Probabilities P(B;) P(B2) eis P(B) 


SSL a a EL LTP TED 


Let us now consider the compound experiment «§, consisting of the simultan- 
eous occurrence of experiments « and 8. This experiment can have k/ outcomes: 


A,B,, AiB,, ong A, Bi; AoB,, A,B, S26 25 A.B; eves AxB,, A,B,, de ag AxB,, 


where, say, 4,3, means that the experiment « has the outcome A, and Ff the out- 
come B,. It is obvious that the uncertainty of the experiment «8 is greater than 
the uncertainty of either of the experiments « and B, because, here, both these 
experiments are realized jointly and each of them can have different outcomes 
as the case may be. 

We shall prove the equality 


HA(«8) = H(«) + HG) 
(the addition law of entropies), which is in good agreement with the meaning of 


entropy as a measure of the amount of uncertainty. By the definition of H(«®), 
we have 


H(o#8) = —p(A,B,) log p(A,B,) — p(A,B,) log p(A,B,) — ... 
— p(A,B,) log p(A,Bi) -- p(A2B,) log p(A,B,) 
— p(A,B,) log p(A,B,) — ... — p(A2B,) log p(A,Bi) 


— p(AxB,) log p(AcB,) — p(AcB,) log p(AxB,) — ... 
— p(AxB,) log p(A:Bi). 


But since the experiments « and @ are independent, p(4,B,) = p(A,) p(B,), 
p(A,B,) = p(A,) p(B2), and so on (See Sec. 2, Chap. 1). Hence, the first / mem- 
bers of the right-hand side can be written as 


—p(A,) p(B,) (log p(Ai) + log p(B,)) — p(A,) p(B.) (log p(A1) + log p(B.) 
—...— p(A;) p(B) (log p(A,) + log p(B,)) 

—p(A,) (p( Bi) + p(Be) +... + p(B.)) log p(Ar) + p(A,) (— p(B.) log p(B) 
— p(B) log p(B.) — ... — p(B) log p(B.) 

— p(A,) log p(A,) + p(Ai) H(8) 


l 


I 


(since p(B,) + p(B,) +... + p(B) = 1). Likewise the 2nd, ..., kth groups 
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of / members in the expression for H(«8) are given by 


— p(A:) log p(A,) -+ p(A,) HQ), 


Cr | 


— p(Ax) log p(Az) + p(Ax) HQ) 
and, hence, 


H(«3) = —p(A,) log p(Ai) — p(A2) log p(Az) — . . . — p(Ax) log p(As) 
+ (p(Ay) + p(A,) +... + p(Ax)) H(B) = H(«) + (8) 


(since, also p(A,) + p(Av) +... + p(Ax) = 1). 


We now assume that experiments « and 8 are not independent (for example, 
that « and § are the successive draws of two balls from one urn; see p. 20). In 
this more general case, we cannot expect that the entropy of the compound ex- 
periment «8 is the sum of the entropies of « and #8. In fact, a case can be conceiv- 
ed here such that the result of the second experiment is completely determined 
by the result of the first (for example, if the experiments « and § consist of the 
Successive draws of two balls from an urn, containing in all two balls of diff- 
erent colours). Thus, after realization of the experiment «, the experiment 8 
completely loses its uncertainty; hence, here it is natural to assume that the 
entropy (the measure of the amount of uncertainty) of the compound experiment 
«f equals the entropy of the single experiment « but not the sum of the entro- 
pies of « and 8 (in the following, we shall be convinced that it is indeed so). We 
shall attempt to make explicit the expression by defining the entropy of «3 in a 
general case. 

We reiterate the conclusion of the formula for the entropy H(«§) of «8, without 
the supposition of « and 8 being independent. Obviously, as before, we have 


A(«8) = —p(A,B,) log p(A,B,) — p(A,B,) log p(A,;B.) — ... 
— p(A,B,) log p(A1B,) — p(A,B,) log p(A2B,) 
am P(AzB,) log P(A, Be) malar ar ae P(A2B,) log p(A,B,) 


Ce  ) 


— p(AxB,) log p(AxB,) — p(AxB,) log p(ArB,) —... 
— p(ArB,) log p(AxB,), 


where by A, A;,..., Ax and B,, B,,..., B, we again denote, respectively, the 
outcomes of a and 8. However, here it is impossible to replace the probabilit- 
ies p(A,B,), p(A,B,) .. . simply by the products of corresponding probabilities. 
In fact, p(4,B,) is now not equal to p(A,) p(B,), but it is equal to p(A,) pa,(B,), 
where p4,(B,) is the conditional probability of the event B, given A, (see Sec. 3, 
Chap. 1). This circumstance is prominently manifested in the following reason- 


ing. 
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As before, we decide to start with the first / terms appearing in the expression 
for H (#8). Obviously, we can rewrite them as 


— p(A,) pa,(B,) (log p(A;) -+ log pa,(B,)) — p(Ax) pai(Bz) (log p(A;) 
+ log pa,(B,)) — ... — p(A,) pa,(B,) (log p(A,) + log pa,(Bi)) 
= —P(A,) (pa,(B,) + pa,(B,) +... + pa,(B,)) log p(A;) 
+ P(A,) (—pa,(B,) log pa,(Bi) — p(B.) log pa,(B,) — ... 
— pay(Bi) log pa,(B;)). 


But 


pa(B,) + pa(B.) +... + pa(B) = pa(B, + Bo +...+ B)=1, 


because B, + B, +... + B,is the certain event (any one outcome B,, B,,..., Bs 
of the event 8 is sure to occur). On the other hand, the sum 


—pA,(B,) log pa,(B,) — pa,(B,) log pa,(B,) — ... — pa,(Bi) log pai(B,) 


is the entropy of the experiment 6 under the condition that the event 4, occurs 
(the entropy of the experiment 8 depends on the outcome of the experiment «, 
because the probabilities of the outcomes of § depend on the outcome of « that 
has occurred). This expression is naturally called the conditional entropy of the 
experiment B given A, and is denoted by Hy, (8). 

Thus, the first / terms of the expression for H (a8) can be rewritten in the form 


—p(A,) log p(A,) + p(A1) Ha,(8). 


In exactly the same manner, the 2nd, ... , Ath groups of / terms of this expres- 
sion are, respectively, given by 


—-p(Az) log p(Az) -+ p(A2) Ha2(8), 


— p(Ax) log p(Ax) + p(Ax) Ha, (8), 


where Hu,(8), ... , Ha,(8) are the conditional entropies of the experiment 8 given 
Ay...» Ak. This yields the formula 


H(28) 


—p(A;) log p(A,) — p(A2) log p(A2) — ... — p(Ax) log p(Ax) 
+ p(A;) Ha,(8) + p(A.) Ha(B) +... + p(As) Ha4,(8) 
H(«) + {p(A,) Ba,(8) + p(Az) Ha(B) + ... + p(Ax) Ha,(8)]. 


I 


The first member of the last expression is the entropy of the experiment «. 
As to the second member (in the braces), this is the the mean value of a random 
variable assuming, with probabilities p(A,), p(A2), .. - » P(Ax), the values H4,(8), 
Ha,() ,- ..» Ha,(8), i-e., the values equal to the conditional entropy of 8 under 
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the condition that the experiment « has outcomes Aj, Ay,..., Ax. This mean 
value is called the mean conditional entropy of 8 under the assumption that « occurr- 
ed or, in short, the conditional entropy of 8 given that « occurred; we denote this 
by Ha(8) : 


Ha(®) = p(A:) Ha,(8) + p(A2) Hab) +... + p(Ae) H4,(8). 


Thus, we finally have 


H(«B) = H(«) + He(8). 


This is also the general rule for determining the entropy of a compound experi- 

ment «8. It can also be called the addition law of entropies, similar to the law 
derived above for a particular case in which the experiments « and 8 are in- 
dependent. 

It ought to be remarked particularly that the quantity H.(@), the mean condi- 
tional entropy, plays a vital role in the problems treated in this book. The point 
is that as soon as we know what specific outcome Ai of the experiment « has 
already occurred, then in the subsequent determination of the conditional en- 
tropy Ha,(8) of the experiment 8 we can completely ignore all the other lines of 
the conditional probability table 


pai(B,), pay( Ba), . . . , pAy( Bi), 
Pa{B,), Pa2( Ba), coe , PA2( Bi), 


pa By), pax(B,), ae) pA; Bi); 


except the members of the ith line corresponding to the outcome Ai. Hence, 
the conditional entropy Ha,(8) does not at all depend on how the probabilities 
of individual outcomes of 6 vary when outcomes of the experiment « are the 
other k — 1 outcomes (out of the total number k of outcomes). Consequently, 
it characterizes very negligibly the relation between the experiments « and 6, a 
complete expression for which is given by the full conditional probability tablef. 
In contrast to this, the mean conditional entropy H«(8) manifests profoundly 
the interdependence between « and 8. We shall further elaborate this in Sec. 3 
of this chapter. 

We indicate some important properties of the quantity Ha(8). Obviously, this 
is a non-negative number. It is clear that, if all probabilities p(A,), p(A,),... 5 
p(Ax) are different from zero, i.e., if the experiment « has indeed k outcomes 


We note that a knowledge of this table (and of the probability (ables of the experiments « 
and 6) enables us to calculate also the conditional probabilities of the outcomes A,, A,,..., Ax 
of experiment « under the assumption that the experiment 8 had some definite outcome B,, or 
B,,..., or B, (in this connection, see p. 22). 
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occurring with positive probabilities, then Ha(8) = 0 if and only if Ha,(B) = 
Ha({8) =... = Ha) = 0, i.e., if and only if for every outcome of the experi- 
ment «, the result of the experiment ® stands completely determined (trivially, this 
condition is satisfied if the experiment # is not indeterminate from the very out- 
set). In such a case, we have 


H(«3) = H(2) 
(see p. 61). If, however, the experiments « and 8 are independent, then 
Ha(8) = Ha(®) = ... = Hax(3) = HO) 
and, hence, 
Ha(8) = H(8). 


In this case, the formula H(«8) = H(o) + H(8) carries over into a simpler One : 
A(«B) = H(«) + H(®) (See p. 60). 

It is quite essential that in all cases the conditional entropy Ha(8) lies between 
0 and the (unconditional) entropy H(8) of 8, which is sometimes called the margin- 
al entropy of 8: 


0 < He(B) < H(B), 


implying that the conditional entropy can never be greater than the unconditional 
one. Thus, the two cases, namely, when an outcome of the experiment 6 is 
completely determined by an outcome of « and when a and @ are independent, 
are two extreme cases. 

The statement that 0 < H.(8) < H(3) is also in good agreement with the in- 
terpretation of entropy as a measure of uncertainty: it is completely obvious 
that the previous realization of the experiment @ can only decrease the amount 
of uncertainty of B or, in the extreme case (Say, in the case when « and 8 are in- 
dependent), does not change this amount of uncertainty, but in no case it can 
increase it fT. 

A complete proof of the statement made (including also a proof of the fact 
that H.(8) = H(8) if and only if the experiments « and # are independent) is 
given in Appendix I at the end of the book; here we shall only demonstrate it 
by an example in which an experiment « has two equally probable outcomes, 


{To avoid possible fallacy, we note that the conditional entropy H4,‘8) can be both smaller 
and greater than the unconditiona’ entrovy A(p) (sce, for example, Problems 18 and 19 below). 
This is reiated to the fact that a change in the probability tab’e of the experiment 8, postulat- 
ed by the circumstances that the other experiment « had a definite outcome A,, can be suffjc- 
iently artitrary (see pp. 21-22). 
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A, and A,. In this case 
Ha(?) = p(A,) Ha,(B) + p(A2) Han(B) =  Ha,(8) + & Ha,(8). 
Thus, our problem reduces to establishing that the inequality 
+ Ha,(8) + 2 Ha(B) < H(8) 
holds. In other words, it is required to show that 
$[ —pa,(B,) log pai(B,) — pa,(B:) log pa\(Bz) — ... — pa,(Bi) log pai( By) 


+ $[ —pae(B;) log pa.(B,) — pa,(B,) log pa,(B,) — ... 
— pa2(Bi) log pa,(B,)] 


< —p(B,) log p(B) — p(B,) log p(Bs) — .. . — p(B.) log p(B), 
where B,, B,,..., B,; are outcomes of the experiment 8. We again consider the 
graph of the function F(x) = —x log x, and suppose that 


OA = pa,(B,), OB = pa,(B;) 


in Fig. 11. Then, the segments 4M and BN are of lengths —pa,(B,) log pa,(B,) 


Fig. 11. 


and —pa,(B,) log pa,(B;), respectively. The sum —} pa,(B,) log pa,(B,) —4 
pa2(B,) log pa,(B,) is equal to the middle line SQ of the trapezium ABNM. On 
the other hand, the segment SP > SQ is equal to —p(B,) log p(.B,), since 

OS = 4 OA + 4 OB = P(A) pa,(By) + p(A2) pa By) = p(B,) 
(see the equation of total probability on p. 23). Consequently, 


—} pa,(B,) log pa,(B,) — 4 paa(B,) log pas(B,) < —p(B:) log p(B,). 
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Similarly to this, the following inequalities are proved : 


—$ pay(B,) log pa\(Bo) — $ pa(Bz) log pA,(B2) & —p(B2) log p(B,), 


SY 


—} pa,(Bi) log p(B.) — 4 paz (B,) log paa(Bi) < —p( Bi) log p(B,). 


Adding all these inequalities, we arrive at the desired result. 
We also note that since the compound events «8 and 6« do not differ from 
each other, H(a8) = H(8«), i-e., 


H(«) + Ha(8) = H(B) + Ha(a). 


This implies in particular that by knowing the entropies H(«) and H(8) of « 
and 8 and the conditional entropy Ha(@) of 8, given that « occurred, we can 
determine the conditional entropy Ha(a) of « given that f is realized : 


Ha(«) = Ho(B) + {H(«) — H(8)}. 


Since 0 & Hala) < H(2), the formula Ha(8) = Ha(«) + H(8) — H(a) 
implies that 


H(8) — H(«) < Ha(8) < H(§); 


when H(8) > H(«), this estimate of the value of the conditional entropy Ha(§) 
is found to be more precise than the one derived on p. 64. The equality 


Ha(3) = H(8) — H(2) 


holds when Ha(«) = 0, i.e., if an outcome of 8 completely determines an out- 
come of «. In such a case, we always have H(8) 2 H(«) (which, obviously, also 
agrees nicely with the meaning of the term the ‘uncertainty of an experiment’). 


Problem 18. Jt is kriown that 2 out of 100 persons on the average suffer from 
a certain disease. For diagnosis, a particular reaction is used which is always 
positive for the case in which a person is sick; if, however, a person is healthy, 
then it is as often positive as negative. Suppose that an experiment B consists of 
determining whether a person is sick or healthy, and an experiment « of determin- 
ing the result of the reaction indicated. The question is: What is the entropy 
H(@) of 6 and the conditional entropy H.(8) of ® given that « is realized ? 

Here, obviously, the two outcomes of the experiment 8, the outcome B, (a 
person is healthy) and the outcome B, (a person is sick) have the probabilities 
p(B,) = 0.98 and p(B.) =0.02. Hence, 


H(8) = —0.98 x log 0.98 — 0.02 x log 0.02 = 0.14 bits, 
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The experiment « also has two outcomes : A, (positive reaction) and A, 
(negative reaction). The probabilities of these two outcomes are given by 


P(A,) = 0.51 and p(A,) = 0.49. 


(This is because the outcome A, occurs in one-half of those cases in which @ 
has an outcome B, and in all cases in which 8 has an outcome B,, but the out- 
come A, is realized only in one-half of the cases in which 8 has an outcome 
B,.) Moreover, if « had A, outcomes, then the conditional probabilities of the 
outcomes of 8 are given by 


2 


49 
pa (B,) = 51 and pa,(B;) == 51 


(because out of 51 cases in which the reaction is positive, a person is found 
healthy in 49 cases, and sick in two cases); hence the conditional entropy Ha,(8) 
will be appreciably greater than the unconditional entropy H(§) : 


49 49 2 2 : 
Ha;(8) = — 3] log si 31 log 31 = 0.24 bits. 


On the other hand, if the experiment « has an outcome 42, then we can state 
with certainty that the experiment 8 had an outcome B, (the person is healthy); 
consequently, 


Ha-fB) = 0. 


Thus, the mean conditional entropy of @ given that « is realized is less than 
the unconditional entropy H(8) : 


Ha(®) = 0.51 x Ha,(8) + 0.49 x Ha,(8) = 0.51 x 0.24 = 0.12 bits. 


In other words, the realization of » decreases the amount of uncertainty of 8 
by roughly 0.02 bits. 


Problem 19. Suppose that the experiments « and 8 consist of drawing succes- 
sively two balls from an urn, containing m black and n — m white balls (« is the 
draw of the first ball, 8 the draw of the second ball). Determine the entropies 
H(«) and H(8) and the conditional entropies Ha(z) and H.(8) of « and 8, 
respectively. Solve this problem also subject to the co:dition that experiment « 
consists of drawing k balls from the urn and experiment 8 is the succzeding draw 
of one more ball. 

We start with the case when « consists of the draw of one ball. We suppose 
that A, and A, (resp. B, and B,) represent the appearance of a black and a 
white ball in the first (resp. second) draw. When nothing is known about the 
outcomes of either the first or the second experiment, we can expect the realiza- 
tion of these events with the following probabilities : 
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Dene ma el 
Outcomes of experiment Ay Ay 
(i) Experiment a: 


" Probabilities 


Outcomes of experiment B, B, 
(ii) Experiment B: 


Probabilities 


Thus, both these experiments have the same entropy : 


mM A a 
n n B : 


H(2) = HQ) = —— log 


If the outcomes of the experiment « be known to us, then, the probabilities 
of the individual outcomes of the experiment @ will have different values. To 
be exact (see above p. 20): 


pay(B,) = = — 7 pa,(By) = - — e 
patB) =", pa(B,) = 2 
Hence it follows that 
zig em 
Hal) = — hs log py — SM log" 


Further, if m <n — m, then 
Ha(@) < H(B), Ha(8) > H(8) 


(because the uncertainty of an experiment, consisting of the draw of a ball from 
an urn with m black and m, = n — m white balls, increases as the ratio m/m, 
approaches unity). Finally, we have 


Ha(8) = p(A\)Ha,(8) + p(A,)H4.(8) 


a log Ba — BF hog —F ] 
~ Hn te pee Sed 
n—-m m m n—-m—1 n—m—1 
T n [-, , 85 l n— 1] log n— 1 


(in all cases H.(B) < H(#)) and 
Ha(*) = Ha(?) + {H(«) — H()} = Ha(P), 
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We now pass on to the more general problem set forth in the second hypo- 
thesis. We denote by «, an experiment « consisting of the draw of k balls from 
an urn, and assume that k does not exceed the numbers m and n — m. In such 
a case, a, can have k + | different outcomes corresponding to the fact that 


among the subject balls there are 0, 1,2, ...,k black balls; we denote these 
outcomes by Ay, Ay, Az, ..., Ax. The probability p(A:) of the outcome A; 
(G@ = 0,1,..., &) equals the ratio ( > \{ a - VI 1 In fact, the total 


number of equally probable outcomes of the experiment «, equals & (the 


number of all possible groups of & balls which can be composed of the available 


n balls), and of them the outcomes ( 4 \( rae ) are favourable to the out- 


come A; (since out of the m available balls i black balls can be selected in 
( ) ways, and the remaining k —i white balls in ( ce ) ways). This 


implies that the entropy of «, is 


H(t) = — 


The experiment 8 has two outcomes, B, (the draw of a black ball) and B, (the 
draw of a white ball). The probability of these two outcomes is, respectively, 
equal to m/n and (n — m)/n. The entropy of 6, as before, is 


H(8) = aOR —— los 7 


m m n—m no— 
n 


Now suppose it to be known that the outcome A; of the experiment o, has 
occurred. This means that, after the realization of this experiment, m — i black 
and n — m — k + i white balls are Icft in the urn. In conformity with this 
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mai _ n-m—ki-i 
pa(Bi) = Pie Ha, pa(B,) = ee a » 
and 
meh mat one mah at no—m—k+i. 
Ha(8) = ook ee n—k log n—k 


To calculate H=,(f) it remains only to make use of the formula 


n—m m\{n—m 
Ha(8) = Lan) Hao() + He Hal8) +++ - 
k 


n 
k 


as 
i) 
(x) 
k 
Finally, the conditional entropy Ha(«,) is defined by 


Ho(%,) = Ha,(8) + H(%x) — HB). 


The case when & is greater than either the number m or nm — m or even both 
can be treated similarly. We shall not analyze here all the possibilities open to 
us but confine ourselves only to a few observations. 

(a) Suppose that k = n — 1. The experiment «,-, has, in all, two outcomes 
A, and A, corresponding to the case in which the last ball remaining in the urn 
is black (white). The probability of these two outcomes equals m/n and (” — m)/n 
because the choice of n -- | balls to be drawn is equivalent to that of the single 
remaining ball and, consequently, our experiment «,-, does not substantially 
differ from the experiment «, consisting of the draw of exactly one ball from an 
urn with balls. Thus, the entropy of cn_, is 


a Ha;(8). 


m m n—-m n—m 
Gia) OE age OB 


i.e., it coincides with the entropy of 8. As to the conditional entropy Ha,-,(3), 
it is obviously 0, because the outcome of xn-, completely predetermines the out- 
come of 8. By analogous reasoning, the conditional entropy Ha(«,-,) is also 0. 
(b) Suppose that k = n — 2. The experiment an_, has three outcomes, Apo, 
A, and A,, corresponding to the case in which there remain in the urn either 
two black balls, or a black and a white ball, or two white balls (we assume here 
that neither of the numbers m and 2 — m is less than 2). The probabilities of 


these outcomes are given by 
( NA ) mn—m 
ee ee J = 2m(n — m) 


dw (2) n= (A,) = 
Pp mares n(n — 1)’ P\A) = (") == ma — 1)? 


n 
P 
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Un. aD, 
n nin — 
Z 


In agreement with this, the entropy of «,_» is 


P(A) = 


— mm -- 1),. mm — m(m — 1) __ 2m(n — m) 2m(u — m) 
Hens) = — Tig — 1) '°8 ne 1) ne — 1) '°8 ee — 1) 
— @—man—m Sly 4s (n — m)(n — m— 1). 
n(n — 1) B n(n — 1) 


The conditional entropy of the experiment 8 given the realization of a definite. 
outcome of «,-»2, is given byt 


Hao(8) = 0, Ha,(8) = l, Ha,.(8) == 0, 


but the conditional entropy of B given the realization of on_, is 
Heano(8) = 


Finally, the conditional entropy of «n-, given the realization of § is 
Ha(%n-2) = Han-2(8) ie H(4y_2) 7 (8). 


(c) If m = 1, then the experiment az has just two outcomes A, and A, corres- 
ponding to the cases in which exactly one black ball is found among k balls 
drawn or among » — k balls remaining in the urn; the probabilities of these out- 
comes are given by 


k —k 
p(A,) = Pe P(Ay) = aes : 


The conditional entropy of the experiment ® given that the outcome A, of the 
experiment «, has occurred is 0: 


H4,(8) = 0, 


(because obviously the outcome A, of « uniquely determines the outcome of 8). 
The conditional entropy of @ given that the outcome A, of «, has occurred is 


1 1 —-k— ay eee 
Ha(8) = a log m--kK—1 log n—-k—-1. 


n—-k  an—k ne a 
. it exceeds the (unconditional) entropy of the same experiment 8, which is given 


tHere Hy, (8) > H(B), since an exper:ment 8 wjth two outcomes cannot have an entropy 
exceeding | bit. 
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ely gx Ed n—1 
n n n 


H@) = —~ tog 


(In fact, if among the balls contained in the urn only one ball differs in colour 
from the rest, then the amount of uncertainty of the experiment, consisting of 
the draw of one ball, is the smaller, the larger is the total number of balls.) 
However, the mean conditional entropy of B 


n—k 


Ha,(8) = Ha)(8) 


is less than the (unconditional) entropy H (3). 


If the pair of experiments « and 8 are carried out many times one after the 
other, then the conditional entropy He(8) characterizes that mean amount of 
uncertainty of the outcome of 8 which remains after the outcome of the experi- 
ment « preceding it is known. In particular, in an experiment on determining 
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Fig. 12. 
average reaction time (see p. 56 and onwards) a whole series of signals is always 
sent and the subject knows prior to each of them what signals have previously 
been given to him. Hence, the amount of uncertainty of the signal to be sent 
here equals the conditional entropy of the corresponding experiment given that 
the outcomes of all previous experiments (i.e., previous sending of signals) 
are known. In the experiments described on pp. 56-59, the successively sent 
signals were always selected independently of each other; hence, in these experi- 
ments, the conditional entropy of « coincided with its unconditional entropy 


2.3. THE CONCEPT OF tNFORMATION 73 


H(«). If, however, the reaction time is actually determined by the amount of 
uncertainty of the signal to be sent, measured by its entropy, then from what 
has been stated above, it necessarily follows that a variation in the amount of 
uncertainty produced by introducing a dependence among successively sent sig- 
nals, must have the same influence on the variation of the mean reaction time 
aS a variation in the amount of uncertainty due to a change in the total number 
of equally probable signals to be used or due to an alteration in the relative 
frequencies of these signals. The results of the verification of this conclusion are 
depicted in Fig. 12, taken again from [47]. In this figure are plotted 8 circles and 
8 squares, encountered earlier in Fig. 10 and, in addition, 8 triangles correspond- 
ing to the results of 8 experiments (performed on the same subjects as previously), 
in which the subject was required to react differently to flashes of each of the k 
lights (experiment 8; in the different experiments, k assumed the values 2, 3, 4, 
5 and 8), which were flashed on the average at an identical frequency p = 1/k, 
but such that the frequencies of flashes of each light substantively depended on 
the light flashed immediately preceding it (experiment «). In Fig. 12, as pre- 
viously, the average reaction time JT is shown on the ordinate (obtained from a 
Series of tests, carried out after a prolonged preliminary training of the subjects 
under controlled conditions in which the individual lights were flashed) and the 
Mean conditional entropy on the abscissa : 


Hal8) = pA.) Hay(@) + p(As) Ha(@) +... + pds) Haul) 
= [[Ha®) + Hal®) +... + Ha] 


(A,, Ao,..., Ag being the outcomes of the experiment a). The circumstance 
that, in Fig. 12, the triangles are found to fall closely along the same straight 
line, around which the squares and circles are grouped, shows that the condi- 
tional entropy Ha(@) is actually that measure of the amount of uncertainty 
which determines the dependence of the mean reaction time of the person on 
the conditions of the experiment. 


2.3. The concept of information 


We recall the quantity H() characterizing the amount of uncertainty of an 
experiment 8. When this quantity is 0, it signifies that the outcome of 8 is 
known beforehand; the value of H(8) being large or small implies that the prob- 
lem of predicting the result of an experiment is complicated or straightforward, 
respectively. Some measurement or observation a, preceding an experiment 6, 
may narrow down the number of possible outcomes of 8 and thereby reduce 
the amount of its uncertainty; thus, the amount of uncertainty of an experiment, 
consisting of determining the heaviest of three loads, is reduced after two of 
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them have been compared by weighing. In order that the result of the measure- 
ment (observation) « may yield information about the succeeding experiment 8, 
it is obviously necessary that this result be not known previously; hence, « can 
be considered as an auxiliary experiment, also having several admissible out- 
comes. The fact that the realization of « cannot increase the amount of un- 
certainty of @ finds itself reflected in the observation that the conditional entropy 
H.(8) of @ given the occurrence of « is found to be less (more precisely, not 
greater) than the unconditional entropy H(§) of the same experiment. In addi- 
tion, if the experiment 8 does not depend on «, then the realization of « does 
not lower the entropy of 8, i.e., Ha(®) = A(®); if, however, the result of « 
completely predetermines the outcome of 8, then the entropy of B reduces to 
zero : Ha(®) = 0. Thus, the difference 


I(«, 8) = (8) — Ha(8) 


indicates to what extent the realization of « lowers the uncertainty of f, i.e., how 
much more we know about the outcome of 8 by carrying out a measurement 
(observation) «; this difference is called the amount of information with respect 
to the experiment 8, contained in the experiment « or, briefly, the information 
about 8 contained in «. 

We have thus a numerical measure of information, which is extremely fruitful 
in many cases. Thus, for example, in the conditions of Problem 18 (pp. 66-67) 
it can be Stated that the reaction used yields information about the incidence of 
the subject disease, close to 0.14 — 0.12 = 0.02 (where we have taken asa unit 
the information given us by a single ‘yes’ or ‘no’ answer to a question, in respect 
of which we are already inclined to consider an affirmative and negative state- 
ment to be equally probable); the digit 0.02 also evaluates the usefulness of the 
reaction. Other examples of employing the concept of amount of information 
shall be adduced in Chapters 3 and 4. 

The relationship between the concepts of entropy and information in a well- 
known sense recalls the relationship between the physical concepts of potential 
and potential difference. The entropy is an abstract ‘measure of uncertainty’; 
the value of this concept to a considerable extent lies in the fact that it enables 
us to compute the influence on a specific experiment 8 of some other experiment 
« as the ‘difference of entropies’ J(x,8) = H(B) — Ha(8). Since the concept of 
information, related to specific changes in the conditions of experiment f, is, so 
to say, ‘more active’ than the concept of entropy, hence for imparting a sharper 
meaning to the entropy it is more expedient to reduce the latter concept to the 
former one. The entropy H(8) of B can be defined as also the information 
with respect to B, contained in 8 itself (since the realization of the experiment 6 
itself, obviously, completely determines its outcome and, consequently, Ha(®) 
= 0), or as the maximum information that can be obtained with respect to 8 
(‘the total information with respect to #6’). Differently, the entropy H(@) of B 
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is the information given by the realization of this experiment, i.e., the average 
information contained in a single outcome of the experiment Bj. These statements, 
which will be extensively used in Chapters 3 and 4, have understandably the 
same meaning as the ‘measure of uncertainty’; the greater the uncertainty of any 
experiment, the larger is the information obtained by determining its outcome. 
We further emphasize that the information, with respect to 6, contained in 
an experiment « is, by definition, the mean value of the random variable H(@) 
— Ha,(B) associated with the individual outcomes A; of «; hence, it can also be 
termed as ‘the mean information with respect to 8 contained in «.’ It may often 
happen that our desire to know the outcome of some experiment 6 may moti- 
vate us to perform an auxiliary experiment (measurement, observation) « which 
can be selected in a variety of ways; thus, for example, when ascertaining the 
heaviest of some system of loads, we can compare the individual loads in differ- 
ent orders. In this case, it is recommended to start with that experiment ¢, 
which contains the maximum information with respect to 6, because in a differ- 
ent experiment « it is likely that we shall obtain a smaller decrease in the 
amount of uncertainty of 8 (the entropy H(3)). In reality, however, it is also 
possible that by chance the experiment « occurs to be more useful than a; in 
principle, the outcome A of a, may turn out to be so unfortunate that the en- 
tropy H4(B) is found to be greater than the original entropy H(8). Such a 


— 


We note that the expression for entropy 


(8) = —p(B,) log p(B,) — p(B) log p(B.) — . . . — p(B,) log p(B,) 


has the form of the mean value of a random variable, taking the values —log p(B,), —log p(B,), 
+.» , —log p(B,) with probabilities p(B,), p(B,), ..., p(B,), respectively (see p. 6). In this 
connection, it may be considered that when a definite outcome B; of our experiment is real- 
ized, we obtain information equal to —log p(B;). In sucha case, if the exseriment 8 has, say, 
altogether two possible outcomes B, and B, with probabilities 0.99 and 0.01, then in realizing 
the outcome B, we obtain quite a small amount of information —log 0.99 = 0.017 bits. This 
is completely natural; in fact, even prior to the experiment we had known that the outcome 
B, was almost sure to occur, so that the result of experiment makes little change in the inform- 
ation available to us. On the contrary, if the outcome B, is realized, then the information 
obtained equals —log 0.01 = 6.6 bits, i.e., it is much larger than in the first case. This is 
natural, since the information obtained as a result of the experiment is here of much greater 
interest (it is the realization of a remotely expected event). However, we seldom obtain such 
a large amount of information with a large number of repetitions of an experiment. Hence, 
the average amount of information contained in a single outcome of an experiment is found 
here to be smaller than in the case in which the probability of both outcomes is equal. We 
further remark that in practical problems we aie always interested only in this average amount 
of information; the idea of the amount of information, related to the individual outcomes of 
an experiment, is rarely applied. 
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situation is completely natural, since the random character of the outcomes of B 
does not obviously permit us to outline in advance the results of this experiment 
via some shortest route; at most, we can work out and indicate the path, which 
is found to be probably the shortest; it is precisely this possibility which is offered 
by information theory.t The individual quantities H(2) — Ha,(B) do not fact- 
ually constitute even the characteristics of the experiment 8, because if the result 
A, of an experiment « is known to us (and « and @ are not independent), then 
we lose the right to speak of the initial experiment 8 and have to take into 
account those changes in the conditions of this experiment which stem from the 
fact that « has an outcome 4;. Thus, Hu,(3) is simply the entropy of some new 
experiment to which the experiment @ reduces piven that the event A; is realized. 


Problem 20. Suppose that an experiment B consists of the draw of one ball 
from an urn, containing 5 black and 10 white balls and an experiment a, consists 
of the preceding draw of k balls from the same urn (without replacement). Find 
the entropy of experiment 8 and the information about this experiment contained 
in the experiments a1, %, % 5, and &44 ? 

The entropy of B is obviously given by 


H(8) = —4 log z = - log cs ~ 0.92 bits. 


Furthermore, by the formulas obtained in the solution of Problem 19, we have 
(in bits) : 


1 1 2 
I(%, 8) _ H(P) ae He;(8) = - 3 log - 7. -2 log 
] 2 2 5 5 
ta(zeg t ziez) 
25 5 9 9 : 
ate (4 14 log — 14 + a8 7 | = 0.004; 


?We should not form an impression that the methods of information theory do not always 
enable us to obtain an absolute evaluasion, say, for a number of auxiliary experiments a, 
needed for determining the result of a definite experiment 8. (By absolute evaluation we under- 
stand here the evaluation which is not only most probable but has an absolute character.) 
Thus, for instance, if the information J(«, 8) equals the entropy A(®) of experiment 8, then we 
can be convinced that with every outcome of the experiment « the result of B stands completely 
defined. In analogy to this, if the information [(«, @) is 0, then with every outcome A; of the 
experiment a the entropy H4,(8) equals the original entropy A(@). In this connection, see 
Chapter 3. 
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= _ hate, hal et Bh ans eet De 
(ay, 8) = H() — HalB) = — 3 log-z — Flog = 
& 
2 5 5 8 8 
- (5 log3 + Fy loe =) 


bi 
ac (is sate ag 


3 10\ ? 
rn Gs 13 log —> “ += + logy 13 =) = 0.008; 


I 1 2 2 2x5x 10 
I(45, 8) = H(8) — Ha,,(8) = — & log 3 = 3 log 3 = “5x14 = 0.44; 


and, finally, 
I(%44, 8) = H(8) ~ He(8) = H(B) (= 0.92). 


Problem 21. Suppose that the probability that there will or will not be rain at 
a certain place on 15 June is 0.4 and 0.6, respectively, and on 15 October it is 
0.8 and 0.2, respectively. We assume that, following a specified method, the 
weather forecast on 15 June is found to be correct in % (resp. £) of those cases in 
which rain (resp. no precipitation) is predicted, when applied to the weather on 15 
October this method is found to be correct in =; (resp. 4) of those cases in which 
rain (resp. no rain) is predicted (the comparatively higher percentage of error in 
the latter case is naturally explained by the fact that a low probability event, 
which is more difficult to guess, is predicted). The question is : On which of the 
two dates indicated, does the forecast yield us greater information about the actual 
weather? 

We denote by 8, and 8, the experiments consisting of the determination of 
weather at the place under consideration on 15 June and 15 October. We 
assume that each of these experiments has, in all, two outcomes, B(rain) and B 
(no rain); the corresponding probability tables have the form: 


Outcomes B B 
(i) Experiment 6, : ——_— 

Probabilities 0.4 0.6 

Outcomes B B 


Probabilities 0.8 0.2 


(ii) Experiment 6, : 


78 2. ENTROPY AND INFORMATION 


Consequently, the entropy of experiments 8, and , is given by 


H(8,) = —0.4 log 0.4 — 0.6 log 0.6 = 0.97 bits, 
H(3,) = —0.8 log 0.8 — 0.2 log 0.2 = 0.72 bits. 


Now, let «, and «, be the forecasts of weather on 15 June and 15 October. 
The experiments «, and ~. also have each two outcomes : A (forecast of rain), 
A (forecast of dry-weather); in addition, the pairs of experiments («,, 8,) and 
(a, B,), are characterized by the accompanying conditional probability tables : 


Pe(B) p(B) PB) pL) 
(i) Pair (a, Bi) : 
0.6 0.4 0.2 0.8 
(2) (2) py (2) (2) 7) 
Py (B) Pp, (B) pj (B) pz (8) 
(ii) Pair (a,, By): Oe 
0.9 0.1 0.5 0.5 


(we recall that pa(B) + pa(B) = pa(B) + pa(B) = 1). These tables enable us 
to determine also the unknown probabilities p,(A) and p,(A); p2(A) and p,(A) 
of the outcomes 4 and 4 of experiments «, and a. In fact, by the equation of 
total probability (see p. 23), we have for the experiment 6, 


0.4 = p(B) = p,(A) p(B) + p,(A) pY(B) = 0.6 x p,(A) + 0.2 x p,(A), 
and for the experiment 6, 


0.8 = p(B) == p,(A) p'2)(B) + p(A) p'2(B) = 0.9 x p,(A) + 0.5 X pal A). 


Since p,(A) = 1 — p,(A), p,(A) =: 1 — p.{A), we obtain 
p\(A) = p,(A) = 0.5, p.(A) = 0.75, p(A) = 0.25. 
We now calculate the entropies Ha(8,), H4(3,), Ha(®z) and H(B.) (in bits) : 
Ha(B,) = —0.6 X log 06 — 0.4 x log 0.4 = 0.97, 


Hz(8,) = —0.2 X log 0.2 — 0.8 x log 0.8 = 0.72; 
and 


Ha(B.) = —0.9 x log 0.9 — 0.1 x log 0.1 = 0.47, 
H;(8.) = —0.5 x log 0.5 — 0.5 x log .05 = 1, 
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Consequently, 
Ha,(8:) = pi(A) Ha(3,) + pi(A) Ha(Bi) = 0.84. 


Hu,(B2) = p2(A) Ha(Bz) + p2(4) Ha(B2) = 0.60. 


It is thus seen that the information, contained in the weather forecast for 15 
June (experiment «,) concerning the actual weather on this date (concerning ex- 
periment B,), is given by 


I(w,, By) = H(B1) — Ha,(B,) = 0.97 — 0.84 = 0.13 bits. 


This is slightly greater than the information concerning the actual weather on 
15 October (concerning experiment (,) contained in the forecast of weather on 
this date (in experiment «,) : 


I(%, 82) = H(B2) — Ha(3.) = 0.72 — 0.60 = 0.12 bits. 


This result enables us to consider the forecast of weather on 15 June, to be 
of greater value than the one on 15 October despite the fact that the latter fore- 
cast more frequently turns out to be correct; really, by the equation of total prob- 
ability, the probability that the weather forecast for 15 June would be found 
correct is 


pi(A) p2) (B) + p,(A) pPP(B) = 0.5 x 0.6 + 0.5 x 0.8 = 0.7, 
whereas, for the weather forecast for 15 October, this probability is given by 
pot) p2(B) + pA) p2(B) = 0.75 x 0.9 + 0.25 x 0.5 = 0.8. 


In general, the amount of information J(«, 8), contained in the forecast « about 
the outcome of some experiment (or random event) 8, is the objective character- 
istic of the value of forecast. It is zero, if Ha(®) = H(8), ic., if « and @ 
are independent events (so that the ‘forecast’ « is, in no way, associated with the 
event (8), or if H(8) = 0 (so that the outcome of B is pre-known and need not 
be forecast); in all the remaining cases, the amount of information is positive, 
but is not greater than the amount of uncertainty H(8) of the experiment 8. 
(Moreover, I(2, 8) = H(8), only if Ha() = 0, i.e., if the ‘forecast’ « uniquely 
determines thc outcome of 8.) However, we note that the universality of the 
considered method of evaluating the quality of any forecast implies that this 
method cannot cover all aspects of a question. In particular, our estimate of 
the forecast completely disregards the contents (meartings) of various outcomes 
of the subject experiment 8 and rests only on the probability of these outcomes. 
However, it seems quite possible in practical life that, owing to the distinct 
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characters of different outcomes of 8, one of them is more crucially important 
to correct prediction than the others. Thus, when forecasting any natural cala- 
mity B (an earthquake, a flood, or even a less hazarduous variant, a frost), it is 
usually of utmost importance that no error be committed in predicting that B 
will not occur, whereas the error in forecasting the occurrence of Bis most often 
considerably less grave (it implies only taking unfounded precautionary mea- 
sures). Such differences among the outcomes of an experiment 8 have to be 
taken care of by other numerical characteristics, different from information J. 

In this connection, we may reiterate with respect to information J what we 
stated abave (see pp. 54-55) with regard to entropy H. The concept of informa- 
tion first arose directly from the needs of communication theory and is espec- 
ially oriented to meet the demands of this theory. Since the transmission of 
message of a specified length over a communication channel (for example, in 
telegraphy) entails roughly the same amount of time and cost both in the case 
of completely trivial or even false information as well as in the case of informa- 
tion about a scientific discovery of far-reaching importance, from the viewpoint 
of communication theory we have to consider that the amount of information 
in these messages is also identical. Obviously, such a definition of the amount 
of information, which completely disregards the meaning of its content, may not 
be appropriate in all the cases in which the term ‘information’ is employed in 
our everyday life. It is, however, plain that the value of any scientific concept 
is determined not by the number of cases it is unable to serve, but mainly by 
the importance and diffusion of concrete problems, in the solution of which it 
is found to be fruitful. In relation to the concept of information, such problems 
are numerous (see, in particular, Chapters 3 and 4). 


Problem 22. Suppose that an eAperiment & consists of determining the position 
of some point M, relating to which we know beforehand just that it lies on a seg- 
ment AB of length L (Fig. 13). Let us also suppose that an experiment « consists 


A TUTTI TTT WT vreligeemr 


= aie 
Fig. 13. 


of measuring the length of a segment AM by meats of some measuring instrument, 
which gives us the value of length to within a definite ‘measurement error’ A (say, 
by means of a scale marked with divisions of length 4). What information about 
the true position of point M is contained in the result of measurement. 
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A cursory inspection reveals that this problem cannot be solved with the aid 
of equations derived above. This is so beacuse at the root of these equations 
there has always been an experiment that can have only a finite number of out- 
comes, but here the experiment 8 has infinitely many outcomes (the point M can 
coincide with any point of the segment AB). And, indeed, we cannot assign 
here any finite entropy to the experiment 8. Nevertheless, it is found that the 
information (a, 8) (which is the difference of two entropies H(8) and H2(@)) has, 
in the case under consideration, a completely defined finite value. In order to 
clarify this, we first assume that the lengths L and A are commensurable with 
each other and split the entire segment AB into small segments of length e, so 
chosen that an integral number of such small segments lies both on the entire 
segment AB and a segment of length A (i.e., to be such that both the ratios L/e 
and A/e can be expressed by an integer). We now tackle the problem of deter- 
mining the position of the point M to within the value «. Since we know, before- 
hand, only that M is placed somewhere on the segment AB, we can consider 
that an experiment Be consisting of ‘determining the position of M to within e’, 
has L/e equally probable outcomes. Hence its entropy is H(®e) = log (L/e). 
Moreover, after carrying out an experiment «, i.e., measuring the length AM, 
using our measuring instrument, it becomes clear to us that the point M actually 
lies inside a small interval of length 4, which determines the accuracy of the 
measurement. Hence, when the outcome a of experiment fe is known, we have 
in all A/e equally probable outcomes and, therefore, Ha(Ge) = log (A/e). 
Consequently, 


I(x, Bs) = H(Be) — Ha(Be) = log 2 — log = ee log: 


For indefinitely decreasing e (i.e., for determining the position of our point 
with increasing accuracy), both the entropies H(Be) and Ha(Be) increase infinitely; 
however, the information J(¢, Bs) is here invariant, always remaining equal to 
log (L/A). It is, therefore, natural that the information J(«, 8) (which we define, 
say, as the limit I(«, Be) as « —» 0) should also be considered to be equal to 
log (L/A). This number gives us the information relating to the true position 
of M, contained in the result of measurement to within 4. For indefinitely 
increasing accuracy of the instrument (i.e., for indefinitely decreasing A), this 
information increases infinitely, though this increase is comparatively slow : for 
an n-times increase in accuracy, we obtain in addition only log n units of 
information (for example, when the accuracy increases twice, we gain | bit of 
information and when it increases 1000 times, the information gained is less than 
10 bits). 

In our reasoning, the lengths L and A are assumed to be commensurable. It 
is, however, obvious thal this assumption is not essential: if « is chosen suffic- 
iently small, then the assumption that an integral number of small segments of 
length « are packed on the segments AB and A is always satisfied to a great 
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accuracy, so that the conclusion cbtained by us may remain invariant even in 
case L and A are incommensurable. 

Here we have simply touched upon the problem of information contained in 
the result of a measurement. For a more detailed discussion, the reader is re- 
ferred to Brillouin [5]. 

We further remark that in solving Problem 22, we encountered a rather un- 
usual situation. We had to deal there with an experiment f having an infinite 
number of outcomes, So that we have to consider the corresponding entropy H(8) 
to be infinite. For computing the information of this experiment, contained in 
another experiment «, we considered an auxiliary experiment fe, obtained by 
combining together the whole group of outcomes of 8, differing from each other 
by a value not larger than a small «. It was also found that both the entropy 
H(Be) of this new experiment and the conditional entropy Ha(Ge) have a finite 
value; furthermore, since their difference was found to be independent of the 
choice of c, we agreed to take this difference also as the information I(«, 8). 

A similar sort of situation continues to recur whenever we consider an ex- 
periment 8 having a continuous set of outcomes. In all such cases, the entropy 
H(§) is infinite. However, in place of it we may often consider a finite entropy 
H(@e) = He-(8), obtained by combining together all outcomes of 8, differing not 
more than by some small «. In practical problems, the entropy He(8) (called 
the e-entropy of an experiment 8B) represents a quite reasonable quantity, since 
we cannot, in general, distinguish the outcomes of 8 that differ from each other 
by less than a definite very small value. This value is determined by the limit- 
ing accuracy of a measuring instrument at our disposal. We shall take up this 
problem again later (see Chapter 4.3). 


Equating the entropy H(«) to the average information contained in the out- 
come of an experiment a, we can, in particular, impart a new interpretation to 
the psychological experiments described on pp. 56-59 and 72-73. It is now seen 
that, according to these results, the mean time required for an accurate under- 
standing of the meaning of some signal, and the proper reaction to it, increases in 
proportion to the mean information contained in this signal. It is natural to 
assume on this basis that in the case in which the events occur with sufficient 
regularity, in other words, are characterized by a definite statistical law (i.e., are 
random events in a strict sense of the probability theory), the information on 
the emergence of such an event is conveyed by the sensory organs and nervous 
system over time that is on the average proportional to the amount of information 
contained in this event. Hence, it can be assumed that the transmission of 
information in living organisms is characterized in many cases by the following 
property : the same amount of information is traismitted on the average over the 
same period of time. It is worth noting here that, as we shall see from the 
contents of Chapter 4, such a property holds also in the transmission of informa- 
tion over all engineering communication lines. 
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There is a Simple consequence from the assumption made and this can be 
verified experimentally. Let us presume that, while carrying out an experiment to 
determine the average reaction time, the subject is forced to react quite fast, so 
fast that he is himself unable to comprehend fully what signal precisely appeared 
before him. For example, let the signal we consider consist of the flashing of 
one of 1 lights and let it be required that the ith knob is pressed when ith light is 
flashed. If the subject is forced to decrease reaction time T, he naturally errs 
with greater frequency, pressing in place of the ith knob, some other knob, say 
the jth ore. This means that, because of the compulsion to react very fast, he 
is not in a position to absorb fully all the information included in the appear- 
ance of a specific signal. If, however, T is not too small, then the subject is able 
to grasp some useful information about the signal. This is manifested by the 
fact that his reactions are not completely disorderly, rather, on the average, it is 
oftner the ith knob that he presses’ than any other one when the ith light is 
flashed. An experiment « consisting of pressing one of the 1 knobs by the sub- 
ject contains here definite information about an experiment § consisting in the 
flashing of one of the # lights. This information I(«, 8) is obviously the aver- 
age information about 8 that the subject is able to comprehend in time 7. 
According to our assumption, the dependence of this information on the reac- 
tion time 7 must be the same as that of the entropy H(8) on T for the case in 
which T is defined as the least time sufficient for an error-free reaction. 
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Fig. 14. 

The last conclusion has been verified by the British psychologist, W. E. Hick 
[46]. The results obtained by him are plotted in Figure 14. The small circles 
denote here the average time determined from the experiments of the same kind 
as described on pp. 56-58. Specifically, before the subject (who is the investigator 
himself in the given case) there are flashed, with equal frequency, 7 distinct lights 
(n ranging from 1 to 10 in different experiments) and the average time 7, requi- 
site for correct reaction to the incoming signal, is measured. As known already, 
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T increases here linearly with the growth of the entropy H(8) = 1(8, 8). This 
is mainfested by the fact that in Fig. 14, where 7 is plotted on the ordinate and 
H(B) = 1(8, B) on the abscissa, all circles fall quite accurately along a single 
straight line. The crosses in Fig. 14 plot the results of an experiment in which 
all 10 lights are used and flashed at the same frequency, but the reaction time T 
is fixed beforehand to be so small that the reaction of the subject in a series of 
cases is necessarily found to be faulty. In order to evaluate the average informa- 
tion contained in an experiment « (the pressing of one of the 10 knobs by the 
subject) about an experiment # (the emergence of one of the 10 signals), a large 
series of experiments N was carried out with one and the same 7 and n;,; was 
calculated, being the total number of all those cases in which the jth knob was 
pressed in response to the flash of the ith lamp. Here i and / take all possible 
values from 1 to 10 and the sum of all :,; is equal to N, where N is the total 
number of experiments, while the total number of all cases in which the subject 
reacted correctly is given by 1,,; + mo,. +... + Mo,19. It is clear that the 
probability of 10 outcomes of experiment ® may be considered here to be given 
approximately by the frequencies 


ne My, + Myo +... + M10, 
1 N > 

Magy -F Maya +. + Mesa _— Mow + Moe +--+ + Mropto 
Be, RS 


and the probability of 10 outcomes of experiment « by the frequencies 


Pi N 
—_ Myo + Moe +... + those Nyro + Mano +... + M1010, 
Pe _ a: | Le J Pio a N 


The compound experiment a? has here 10? = 100 different outcomes, whose 
probabilities are approximately equal to the following frequencies 


M11 Ny ,2 — Aor 
Pia = yo Piz = yrs + 9 Prosio = N° 


This yields the following expressions for the entropies of experiments B, « and 


af : 


H(®) = —q, log q — gz log gz — ... — Gio 10g Fro, 
H(«) = —pi log p; — p, log pe — . - - — Pro 108 Pro: 
A(«3) = —Piy1 108 Piyr — Piye ]Og Py,2 — - ++ — Prosto 1OE Pso,105 


which permit us to calculate approximately these entropies by the experimentally 


2.3. THE CONCEPT OF INFORMATION 85 


determined numbers 7;,;.. Then, from the formula 
H(aB) = H(«) + Ha(8) 

(see p. 63) the mean conditional entropy H=(®) can be defined as 
H.(8) = H(«) — H(@). 


Moreover, by H() and Ha(8) we can also determine the information I(a, 8) 
about the experiment f, contained in the experiment « : 


(a, 8) = H(8) — Ha(8). 
This value of I(, 8) is used as the abscissa of crosses in Fig. 14. 

We note that the setting of the experiment here is in a certain sense converse 
to that considered on pp. 56-59 and 72-73. Earlier, the information J(6, 6) = 
H(8) was defined beforehand and we investigated the dependence of reaction 
time T on it. Against this, 7 is now specified beforehand (i.e., it is required for 
the subject to react over a definite time 7 after the emergence of signals) and 
the dependence of information I(«, 8) on this is studied. The circumstance that 
the crosses in Figure 14 cluster around the same straight line as the circles do 
affirms the surmise that the reaction time is linearly dependent precisely on the 
information contained in a signal. 

There is obviously no justification to extend the results of these few experi- 
ments, carried out in a highly specific set up, to all general processes of the 
transmission of information in a living organism. In fact, a simple linear depend- 
ence between the reaction time and the information contained in a given signal 
is not observed in all experiments. Besides, even in those cases in which such a 
dependence holds, the coefficients of corresponding linear functions may take 
highly different values (see, for example, Fig. 15 taken from Nikolaev [51]; also, 
see [52] and the book [50] containing more than 500 references). The factors 
on which these coefficients depend have been studied by many authors (see, for 
example, the review papers [48] and [49]); in this field, there still remain a large 
number of open questions, however. Nonetheless, the available data (see refer- 
ences cited above and also [42] and [19]) show positively that the quantitative 
concept of information can often be used successfully to give a mathematical 
description of the processes of perception and assimilation of various sorts 
of signals by living organisms that are transmitted to the organisms from the 
external world. 


We shall now show that the information with respect to an experiment B con- 
tained in an experiment o. is always the same as the information with respect to « 
contained in 8. This is immediate from the equations of preceding section: since 


H(«) + Ha(8) = H(8) + Hala) 
(seep. 66), it follows that 
I(x, 8) = H(®) — Ha(8) = H(«) — Hal) = 18, «). 


Reaction time, 9.001 sec 
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Thus, the information J(«,8) that « contains with respect to 6 can also be called 
the reciprocal information of two experiments « and @ with respect to each other. 
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Fig. 15. 


The equality between information I(«, B) and information 1(8, @) is emphasized 
by the following simple equation, which is found to be extremely convenient in 
many cases: 


T(a, 8) = H(«) + H(B) — H(aB) 


(see, for example, p. 85). This equation stems from the fact that Ha(8) = 
H(«8) — H(«) (because H(«8) = H(a) + Hz(8)); the experiments « and 8 enter- 
ing the right-hand side of this equation are completely symmetric. 


1 —O Yick's experiment 
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The symmetric equation derived here for the amount of information can also 
be usefully transformed. This transformation simplifies its right-hand side, so 
that it can be expressed directly in terms of the probabilities p(A;), ..., p(Ax), 
P(B,),..., p(B) and p(A,B,), p(A,B,),..., p(A,Bi) of distinct outcomes of 


%, Band«8. In fact, according to the definition of entropy, 
H(«) = —p(A;) log p(Ai) — p(Az) log p(A2) — ... — p(Ae) log plAn), 
H(8) = —p(B,) log p(B,) — p(By) log p(B,) — ... — p(B,) log p(B,). 
and 
H(aB) = —p(A.B,) log p(A,B,) — p(AiB,) log p(AiB,) — ... 
a P(AgB,) log P(A;B,). 
On the other hand, by the addition law of probabilities (see p. 9), 
p(Ai) = p(A;B,) a p(A;:B,) Fe asa fe pP(A;B.), i= l, 2; ees k, 
and 
p(B;) ao P(A,B,) + p(A.B,) +... + P(A, By), Ge Zeeks L 
so that 
—p(A;) log p(Ai) = —p(A:B,) log p(Ai) — p(A:B,) log p(Ay) — ..- 
— p(A,Bi) log p(A:), 
—p(B;) log p(B,) = —p(A,Bi) log p(Bi) — p(A,B,) log p(B,) — ... 
— p(A,B;) log p(B,). 


Substituting all these expressions in the original equation, we get 


I(«, 8) = —p(A,B,) [log p(A,) + log p(B.) — log p(4,B1)] 
— p(A,B,) [log p(A1) + log p(B.) -- log p(A,B:)] 


Co 


— p(A,B:) [log p(Ax) + log p(B:) — log p(AzB,)I, 
or, finally, 


= 6 _P(A:B,) 2 p(A,B,) 
I(2, 8) = p(A,B,) log D(A,) ve P(A, B,) log P(A,) p(B.) =f eee 
P(A LB, . 
+ PCARB) 108 544) PCB) 


This equation also is obviously symmetric in « and 6. 
The equation 


I(a, B) = I(B, «) 


88 2. ENTROPY AND INFORMATION 


can also be written in the following form: 
I(a, 8) = H(«) — Ha(«). 


From this it follows that the information I(a, 8) contained in an experiment 
with respect to an experiment B does net exceed the entropy H(«) of «, a fact that 
is often found useful. This premise obviously cannot be considered as some- 
thing unexpected. It is natural that the information that « contains about 
another experiment @ does not exceed the information contained in « with respect 
to itself, the entropy H(a) of this experiment. Thus, the entropy H(«) can also 
be defined as the maximum information which can be contained in an experiment 
a (the ‘total information’ contained in a). 

From the formula I(a, 8) == H(«) — Ha(«) it also follows that the information 
I(«, B) is precisely equal to the entropy H(«) of « if and only if the conditional 
entropy Ha(«) is 0, i.e., if the result af experiment 8 completely determines the 
outcome of the auxiliary experiment «. The position will be precisely so, for 
instance, in the problems analyzed inthe next chapter. If, however, Ha(«) ~ 0, 
then the information I(«, 8) equals the entropy H(«) minus the value He(«). In 
particular, if the experiments « and B are independent (and only in that case), 
I(a, 8) is 0. 

We further note that, if the conditional entropy Ha(a) is 0 and, consequently, 
the information J(#, 8) with respect to 8, contained in «@, is the maximum (i.e., 
the experiment « does not contain more information about any other experi- 
ment @,), then the information with respect to every experiment y independent of 
8, contained inw, is 0. This provides the justification to say that the experiment 
a is ‘directed straight’ at elucidating the outcome of 8 and does not contain any 
‘extraneous’ information. In the general case, however, the information with 
respect to any experiment y independent of 8, contained in «, does not exceed the 
quantity Ha(«) = I(a, «) — I(a, B) (if Ha(«) = 0, then this statement converts 
into the more particular result indicated above). The proof of the statement 
made demands the introduction of an important auxiliary concept; it will be 
adduced (together with the proof of other statement formulated below) at the 
end of the present section. 

We now suppose that «, 8 and y are three arbitrary experiments. In such a 
case, we always have 


I(By, «) # I(8, «); 


in other words, the information contained in a compound experiment By (i.e., a 
pair of experiments 8 and Y) with respect to every experiment « is never less than 
that contained in a simple experiment ®. This fact is completely natural from 
the viewpoint of our heuristic notions on ‘information’; a rigorous proof of this 
and similar propositions provides a justification for the use of the term ‘informa- 
tion’ in relation to the quantity J(@, 8). In addition, the equality I(By, «) = I(8, «) 
holds if and only if the conditional probability of any outcome of «, given 
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that 8 and y have certain definite outcomes, remains invariant for a change in 
the outcome of y (i.e., it depends only on the outcome of ). In the latter case, 
it is quite natural to consider that the compound experiment Py contains no 
additional information with respect to « in comparison with the experiment B, so 
that the equality J(By, «) = (8, «) is also in full agreement here with the intu- 
itive meaning of the concept of ‘information.’ 


Let us now assume that the equality I(ByY, «) = (8, «) holds. It can be shown 
that, in this case, we always have 


I(y, «) & 1(8, «). 


Thus, if the compound experiment By contains no additional information about « 
in comparison with the experiment 8, then the information about « contained in y 
cannot be greater than that contained in®. In addition, the ‘less than or equal 
to’ sign in the last inequality can be replaced by the ‘equality’ sign if and only 
if [(By, «) = I(y, a), i.e., if the compound experiment By does not contain add- 
itional information about « also in comparison with the experiment +. 

The inequality I(y, «) < 1(8, «), referred to above, plays a significant role in 
information theory (see, for example, [14] and [44] as well as Chapter 4 of this 
book). It says that in successive transmissions of information about an experiment 
« realized by a chain of experiments 8, y, 5,. ..., where only f is directly assoc- 
ijated with « but y receives all of the information contained in B about « from 
jts association with B (so that By contains no additional information about « as 
compared with §), 8 receives all of the information about « from its association 
with the experiment y and so on, the information about « alone can only be 
reduced: 


H(a) = I(a, «) > I(8, 2) > My, *) > 18,2) >... . 


As an auditory illustration of this situation, we can consider the well-known 
children’s game of a ‘garbled telephone’. In this game, the first player utters 
quietly into the ear of his immediate neighbour some word (the experiment a), 
the neighbour quietly conveys the word heard by him (which may also differ 
from the one pronounced originally) to the next player (the experiment 8), this 
player also conveys the word heard by him to his immediate neighbour (the ex- 
periment Y), and soon. At the close of the game, each player tells what word 
he heard and among the participants the one who was first to hear incorrectly 
the word conveyed to him is regarded to be the loser. In this game, it may so 
happen that the second player conveys the originally spoken word incorrectly 
but the third, in consequence of another error, says that he heard the same word 
as that conveyed in the beginning; however, when this procedure is repeated a 
large number of times, the second player certainly conveys the word uttered by 
the first player on the average more often than the third player. But our con- 
cept of information J is precisely also a statistical concept, characterizing rela- 
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tions that, hold ‘on the average’; hence the whole string of inequalities set forth 
above are always satisfied. It is clear that, from the viewpoint of the intuitive 
notions on the transmission of information, this situation can also be considered 


as obvious. 
The inequalities 
I(py, a) > 7(8, a) and T(By, a) > ry: a) 


(see p. 88) can be augmented by one more inequality that is somewhat less obvious from the 
viewpoint of the intuitively expected properties of the quantity given the name ‘information.’ 
It is clear that, in general, it is completely plausible for the inequality 


I(By, «) < I(B, «) + ICY, =) 
to hold. In fact, if, say, 8 = y, then also By = 8, and hence usually in such a case 
I(By, x) == 7iB, %) < 1(B, ~) + ICy, x) = 27(6, ). 


If, however, the experiments @ and y are independent (i.e., I(8, y) = I(y, B) = 9), then the 
inequality /(By, «) < I(®, «) + I(y, «) is impossible; in this case, we necessarily have 


T(py, a) > TQ, a) + Icy, a). 


The inequality 7(8, «) + I(y, «) > /(By, «) being impossible here is explained by the fact 
that the independence of experiments 8 and y guarantees the absence of a ‘common portion’ 
of the quantities 7(@, «) and 7(y, «). In fact, here experiments 8 and y supply substantially 
different information about the experiment « and therefore the information /(By, «) associated 
with the simultaneous realization of both the experiments B and y cannot be less than the sum 
of 1(8, «) and 7(y, «). This can be compared with the inequality 


area (F, + F,) < area F, + area Fy, 


where F, + F, is the union of figures F, and F,. This inequality is obviously impossible if F, 
and F, do not have a common part. However, it seems that here we may expect the equality 


T(By, «) = (8, %) + I(y, a), 


because it remains obscure as to owing to what circumstances /(®y, «) can be found to be 
greater than the sum of J(B, «) and J(y, «). 

The matter, however, is that even for the case in which 8 and y are independent, their joint 
occurrence (i.e , the experiment By), which enables us to know simultaneously the outcomes of 
both B and y, can generally supply more information than that given by the individual realiza- 
tions of 8 and y (with which the quantily /(8,) + I(-y, «) is associated). This can be illustrated 
by the example printed in small type on pp. 25-26. We recall the tetrahedron of Fig. 2 and 
suppose that the experiments «, 8 and y consist of verifying, respectively, that the digits 1, 2 
and 3 are or afe not on the same side on which the tetrahedron falls. In this case, « can have 
the outcomes A and A, 6 the outcomes B and B and y the outcomes Cand C. From the cal- 
culations derived on p. 25 it is immediate that a, @ and y are all independent. Thus, we have 


1(B, «) = Oand I(y, a) = 0, so that 7(B, «) + I(y, «) = 0. 


On the other hand, the results of the compound experiment By completely determine the out- 
come of «. (In fact, experiment « has an outcome A if 8 and y have a ‘common’ outcome, i.e., 
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both 8 and y have, respectively, the ‘positive’ outcomes B and C, or even ‘negative’ outcomes B 
and C; « has an outcome A if 6 and y have ‘different’ outcomes, i.e., 8 and C or B and C.) 
Thus, we have 


I(By, «) = H(a) = 1 bit, 


I(py, a) > Ig, «) + I(y, a) (= 0). 


Furthermore, here experiments § and y contain no information about a, but experiment By con- 
tains ‘complete’ information about «a, i.e,, the maximum information one can have about «. 
The proof of the statements made above can be deduced by studying the quantity 


Ig(y, «) = Hg(2) — Hpy(a), 


which we call the mean conditional information of two experiments y and a with respect to each 
other, given that experiment ® is realized, or, for short, simply the conditional information of ex- 
periments y and a given B®. We first note that the conditional information Tg (y, «) is always non- 
negative. This fact straightaway stems from the inequality 


Hay(«) < Ha(a), 


signifying that the prior realization of a compound experiment By (i.e., the \wo experiments 8 
and y) always reduces the amount of uncertainty of the experiment « to an extent not less than 
the realization of a single experiment 8 (for a rigorous proof of this inequality, see Appendix 
lat the end). S:nce, in addition, we always have Hg y(a) > 0 (because Hp y(«) is some condi- 
tional entropy), hence 


0 < Igy, «) < Hg(a). 


Moreover, Ja(y, «) = Hg(«) if and only if Hg+(«) = 0, i-e., if the compound experiment By 
uniquely determines the outcome of experiment «; J,(y, «) = O if and only if Hg y(*) = Ha(«) 
and, consequently, also /\By, «) = 7(6, «), i.e., if the conditional probabilities of all outcomes 
of «, given that 8 and y have some specific outcomes, do not depend on the outcome of y (see 
the end of Appendix I). 

We shall now show that the conditional information has symmetry property : 


Ta(y. a) = Tg(a, Y) 


(this property is emphasized by the very name ‘conditional information of experiments y and 
« with respect to each other’). In fact, by definition 


Ia(Y, *) = a(x) — Hgy(a), Ip(a, y) = Healy) —Aaa(y)- 


But the compound experiment e®y, consisting of the realization of three experiments «, 8 and 
y, can be considered with equal justification to be a union of the joint experiment «8 and 
experiment y, or also as the union of « and the joint experiment By.f Hence, 


H(ay) = Hap) + Ayp(y) = A(B) + Hala) + Aaaly), 


tSymbolically. this can be written as the equation 


aby = (aB)y = a(By) 
(cf. the ‘associative law’ of multiplication of events on p. 38, Chap. 1.5), 
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and 
H(«By) = Hy) + Hpy(a) = HG) + Helx) + Hey(a)- 
Consequently, 
Hp(a) + Hag(y) = Ag(y) + Heyl), 
Le. 


Ia(y, «) = Hp(«) — Hpy(«) = Hp(y) — Haar) = Ip(a, y)- 


The equality Jg(y, «) = Ig(«, y) is also implied by the following ‘symmetry expression’ of 
conditional information /g(y, «), which is trivial to verify directly : If A; (where i = 1, 
2,...,/), By (wherej = 1,2,...,m) and C; (where k = 1,2,...,,) are all possible out- 
comes of experiments «, 6 and y, then 


Igy, %) = p(Bi) In,(y, «) + p(B2) Ipa(y, &) +... + P(Bm) Jag(y, %)- 


Here 
Ppj(A,C,) pBi(A,C,) 
I3.(y, «) = A,C,) log —-=, - see A,C,) 1 — 
Bt a) — Pal ALC) lo Pa,(A1) Pa (Ci) FO RU Ea Ee PBS A) PBs(Ca) 


is reciprocal information of experiments « and y, given that experiment ® has outcome B;. 
Such an expression neatly explains the meaning of the conditional information In(y, a); this 
shall not be needed by us, however. 

From the equation Inv, a“) = Hp! a) — Aay(a) it is easy to obtain the important relation 


T(By, «) = I(B, a) + Ia(y a), 


close in form to the equation H(By) = Hib) + Hf). (The stated relation for /(®y, «) stems 
from the fact that (By, «) = H(a) — Hgy(«) and 1(B, «) = H(a) — Ha(a).) It is obvious that 
our assertions concerning the amount of information /(Gy, «) are automatic consequences of 
this relation and the properties of conditional information. 

In the sequel, we shall also find fruitful the following triple information equation : 


T(By, «) + 1(8,y¥) = I(ay, B) + Ia, y). 


For proof of this equation, it is necessary just to interchange the places of @ and y in the ex- 
pression obtained for /(@y, «) and use the analogous expression for /(«y, 8). By carrying out 
this procedure, we obtain the same expression on the right- and left-hand sides of our formula 


T(By, a) + L(G, y) = My, %) + Jy(B, «) + (8, y), 
and 
I(ay, B) + Ta, y) = My, B) + Iy(a, B) + (a, y). 


From the triple information equation we obtain directly the conclusion indicated above 
about the amount of information with respect to y contained in experiment a, when v is 
independent of some other experiment 8. In fact, the independence of 8 and y implies that 
I(B, y) = 0; on the other hand, we know that J(ay, 8) > /(a, B) always. By virtue of the triple 
information equation it therefore follows, with 8 and y independent, that 


I(a, y) = I(By, «) — May, B) < UBy, a) — I(a, B) = Iply, &); 
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moreover, In(y, a) is never greater than Fpl a). On the other hand, making use of the ‘sym- 
metry’ property of information (i.e., the equality J(«, 8) = /(f, «)), we can rewrite the triple 
information equation as 


(By, «) + 1(B, y) = 1(B, xy) + L(y, &); 
and replace the inequality /(ay, 8) > /(«, 8) by the inequality 
(8, ay) > I(B, «). 
This implies stratightaway that, with 8 and y independent (i-e., with Z(®, y) = 9), 
T(By, a) > I(B, w) + Ty, «) 


(see p. 90). 

The inequality [(y, «) < 7(B, «) with In(x, «) = 0 can also be obtained from the triple 
information equation. The derivation can be made if we only replace /(«y, 8) in this equation 
by J(y, 8) + /y(a, B) and note that in this case /(By, 2) = J(B, «), and that the information 
always has the symmetry property. Then, we arrive at the relation 


78, a“) = I(y, «) + Iy(a, B), 


proving that our inequality holds. It is also observed that this inequality becomes an equality 
if and only if 4,(a, 8) = 0. In this case J(y, «) = J(By, «), i.e., the compound experiment By 
contains no additional information with respect to « in comparison with y, a situation we had 
noted earlier also. 

Finally, we recall the fact that the inequality /(By, «) > J(6, «) (which says that ‘the infor- 
mation contained in a compound experiment By about any experiment « is not less than that 
contained in a simple experiment 8°) can be associated in a sense with the entropy inequality 
(Gy) > H(B) (‘the amount of uncertainty of a joint experiment By is never less than that of 
a simple experiment 8’). However, in the entropy case, there is also one more estimate of the 
quantity H(y) in a different direction: H(By) < H(B) + A(y) (‘the amount of uncertainty of a 
compound experiment By is never greater than the sum of the uncertainties of the individual 
experiments 8 and y’). In the case of information, the position is rather different : knowing 
the amount of information about an experiment « that is contained in two experiments 8 and 
Y, we cannot estimate from the above the information concerning « that is contained ina 
compound experiment By. Thus, in the case analyzed on pp. 90-91 (where the experiments 
a, Band y consist of determining that the digits 1, 2 and 3, respectively, appear on the side on 
which the tetrahedron of Fig. 2 falls), we would have 


I(®, «) = I(y, «) = 0, but J(By, «) = 1 (= A(a)). 


Hence, from the fact that 7(8, 2) and J(y, «) are both small, it is impossible to infer that 
I(Gy, «) is small, too. 


2.4. Entropy (revisited). The determination of entropy from its properties 


The central theme of this chapter is the concept of entropy as a measure of the uncertainty 
of an experiment « having random outcomes. The motivation of Section 2.1! was to explain 
how the conventional definition of entropy is ‘natural’; however, the corresponding arguments 
were only of a leading nature. The real justification for such a definition of the measure 
of uncertainty is provided by the whole string of theorems proved in this chapter and Chapter 
4,as well as in Appendix I. We shall now recall tke definition of entropy and show that it 
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necessarily stems from the elementary requirements naturally imposed on a quantity that is 
called upon to serve as the quantitative measure of the amount of uncertainty. 

It is natural to assume that the measure of the amount of uncertainty H(«) (which we call 
entropy) of an experiment « with the accompanying probability table : 


Outcomes of experiment A, Ae 08 A, 
Probabilities P(A,) ——p(A.) eidue P(Ax) 
a TR LF DL TP YI I aS 
must depend only on the variables p(A,), p(A,), ..., p(A,) (i. €., it is a function of these vari- 


ables). We denote here the probabilities p(A4,), p(A2), . .- , P(A) by Pi, Por... , Py and the 
entropy H(«) by H(p,, P2,.- - , Px) (see p. 50). 

We now formulate those conditions that are naturally required to be satisfied by the function 
FI(p,, Po, ~.- » Px). In the first place, it is plain that this function does not have to depend on 
the order of the numbers p,, Pe, . ..,» Px; in fact, a change in the order of these numbers (i.e., 
a change in the columns of the probability tabie) is not associated with any change whatsoever 
in the experiment « itse!f. Thus, the first condition says that 


E.1]. The value of the function H(p,, Po... , Py) remains invariant under any rearrangement 
of the numbers Py, Da, . . «5 Dy: 


The second condition is also equally natural : 


E.2. The function H(p,, Po, . . . , Pe) is continuous, i.e., it varies by a small amount for small 
variations in the probabilities p,, ps, ..., Dre 


In fact, a small change in the probabilities must evidently correspond to a smal! change 
in the amount of uncertainty of the experiment. 


The third condition we now introduce is slightly more complex. In order to have an 
insight into what it consists of, we first presume that the experiment « has in all three out- 
comes A, A2, Ay, i.e., its probability table has the form 


Outcomes of the experiment A, Ay Ay 


Probabilities Pi Pr Ps 


The measure of uncertainty A(a) of this experiment equals A(p,, pe, Ps). The uncertainty 
arises in this case because of the fact that we do not know specifically which of the three out- 
_ comes of « will occur. We shall now clarify in two parts which of these outcomes of « actually 
occurs. First, we determine whether either of the first two outcomes A, and A, or even the 
last outcome Aj, has occurred; this means that our experiment « is replaced by a new experi- 
ment 8 with the probability table 


A a ce 


Outcomes of experiment B A, 


Probabilities Pi + Pa Py 


The measure of uncertainty of this new experiment is obviously H(8) = A(p, + Pe, Py). It is 
clear that the uncertainty measure of a must be greater than that of ®, this is connected with 
the fact that knowledge of the outcome of 8 does not yet completely determine the outcome 
of a, since even after the outcome of 8 is revealed there may still remain some uncertainty in 
the outcome of a. 
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It is not difficult to answer the question as to exactly how much greater the uncertainty 
Measure of « must be than that of B. Let us repeat an experiment « many times and each 
time reveal at the beginning whether experiment 8 had the outcome Bor A,. It is then clear 
that in certain cases, in those in which « has the outcome A,, this revelation completely solves 
the problem of the outcome of a, too. In other cases, in those when « has outcome A, or Az, 
after having ascertained the outcome of @, it is appropriate to determine precisely which of 
these two outcomes of « occurs, which is equivalent to determining the outcome of a new 
experiment 8’ with the probability table . 


Outcomes of experiment Ay A, 


Pr Pa 
Pi + Pe Pi + Pe 


sr 


Probabilities 


The measure of uncertainty of this experiment 8’ is obviously A(p’) = Al[(pi/(pi + Pad) 
(p./(P: + P2))]. But since the probability (i.e., the average frequency) of a case in which, after 
the realization of 8, it is further necessary to determine the outcome of 8’, is equal to Pp, + Pa, 
it is natural to assume that the measure of uncertainty H(«) of « must exceed the measure of 
uncertainty H() of B by the quantitity (p, + p2) 4(8’), i.e., that the equation 


vei A = A ’ ‘ H P xt —o_) 
(Ps Pe» Ps) (Pi + Po» Ps) + (pi + Pe) X (s, +P, Pit Pr 


must be satisfied. The same considerations applied to an experiment « with the probability 
table 


Outcomes of experiment Ay Ap Ag oe Ay 
Probabilities Py P2 P34 eee Pr 
lead to the following third property of the function H(p,, po,..., Px): 
E.3 The function H(p;, Po, . . «> De) satisfies the relation 
A(Py, Pay. - + > Pe) = ACP, + Po, Py +++» Pr) + (Pi + Pa) H( ft, 1). (1) 
Pi + Pe Pi + Pe 


This relation signifies that the uncertainty A(8) of B, with the probability table 


Outcomes of experiment B Ay ats Ax 


Probabilities P1 + Ds Ps beet Pr 


a TTT — SS 


obtained by the identification of the first two outcomes of the experiment «, equals the uncer- 
tainty H(«) of a minus the measure of uncertainty of the experiment 8’ multiplied by p, + pz. 
This seems quite natural since the experiment 8’ consists precisely of determining specifically 
which of the first two outcomes of « will occur, if one of these two outcomes is known to 
occur. 
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It can be shown that conditions E.! through E.3 completely determine the form of the 
function A(p,, pe, ... » Py) : the only function satisfying all these conditions has the formt 


H(p,, Po,» -.» Pe) = C(—Pp, log py — Pz log pa — ... — Py 10g Py). (*) 


However, the proof of this fact is not quite straightforward; it was first given by Faddeev [45]. 
Later, it was also shown that condition E.2 can indeed be considerably weakened. For 
example, it can be replaced by the condition E.2a: the function H(p, 1 — p) is continuous at 
the point p = 0 (i.e., H(p, 1 — p) > H(O, 1) as p > 0), or the condition E.2b : the function 
H(p, | — p) does not change sign and is bounded on the interval 0 < p < 1; if either of the con- 
ditions E.2a or E.2b is valid, then formu!a(*) also follows uniquely from conditions E.J and 
E.3. Some other admissible versions of weakening condition E.2 and an extensive list of rele- 
vant references can be found, for example, in Dardczy [43]; see also Aczél, Forte and Ng [41]. 
However, we shall not further overstretch our treatment to the utmost generality. Following 
Shannon [21], we shall not only regard all three conditions E.J—E.3 to be valid but we shall 
also supplement them with one more condition, whose validity can, in principle, be proved 
by using conditions E.J—E.3, but which is postulated here for the sake of considerably 
simplifying our reasonings 

In the sequel, an important role is played by the function A(1/k, 1/k, ..., 1/&), the measure 
of uncertainty of an experiment «, having k equally probable outcomes. It is obvious that, by 
virtue of the fact that all outcomes of «, are equally probable, the amount of uncertainty A(«,) 
depends only on the number of outcomes &, i.e., it is a function of a single argument k: 
A(i/k, Wk, ..., Wk) = f(k). It is also clear that the amount of uncertainty of «, must be 
larger, the larger is the number k of these outcomes. Thus, we can assert that 


E.4. The function H(\{k, 1/k,..., 1/k) = f(k) increases with k. 


We now show that the function H(p,, po, . .., pe) satisfying conditions E.1—E.4, necessarily 
has the form (*) (where ¢ is some positive number). For this, we must slightly generalize 
equation (1), whose validity is guaranteed by condition E.3. We first show that 


A(py,..«, Pr) = W(py +... + Dis Dias ++ Pe) + (Pit. +. + Pd 
P2 


Pi Pi . 
x H(—_—"_., a), ick. 
Prt... +Pi Prt... + pi Prt... +D; . 


(The meaning of this equation is obviously similar to that of the original relation (1) with the 
only difference being that here the i outcomes A,, A,,..., 4; of experiment « are combined 
together as the sole outcome B of experiments 8.) Wheni = 2 this equation coincides with 
(1) and is, consequently, valid by virtue of condition E.3. We now assume its validity to be 
proved already for some value /; in such a case, by making use of its validity also for i = 2, 


TIf the coefficient ¢ is required to be positive, then it is necessary to specify also that the 
function Api, Po, ..., Px) must be non-negative (of course, it suffices to include in the basic 
conditions the requirement that one variab!e, say, H(4, }) be non-negative). We further note 
that if the basic system of logarithms is not already fixed, then the multiplier ccan be discard- 
ed in formula (*) (since c logg p = log, p, where b = at!¢), 
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we have 
H( Pi, Pos.» +> Pr) = A( Pi + D2 +--+ + Diy Diary +++ >Pe) + (Pi t+ Po t.-- + Ds) 


Pr P; 
x H a) 
eoeers? Prot... + Bi 
= { Hr. + P+ weet Pe + Pitas Pitas - ++) Pk) + (Pit .-- + Bi t+ Piss) 


Pit .-++ Bi Piss )} 
Me ey oe + +... 
(? +... 4+ Digs Di tee + Divs (Pi + Pi) 
x a _, Sage), 
Ppt... +P; Prot... + Di Prt... + Pit 


On the other hand, since our equation is regarded to have been established for the value i, it 
follows that 


# ( Pr cay Pi Pivr ) 
Prt... t+ Dien "Pt... + Pin’ Pit... + Din 
—H (Pht Pit ) Pp+..-+3P: 
Prt... + Ding Pit... + Din P 


rt... + Dias 
( Pr Pa es Di 
Pit.-- +P? Pit... +Pe TP tee +P 


This directly implies that the equation under consideration is valid for the value i + 1 : 


H( Pi, Pa» ++ > Pe) = A( Py + Pa + «~~ + Dist Pitas + +> Pe) + (Pi + Pa +--+ + Piss) 


Py Pi+1 
eH (o By BY, 
—— Prt... + Pits 


By the principie of mathematical induction we can now be convinced of the fact that the pres- 
cribed equation is satisfied for every i. 

Since the function A( pi, p2,..., Pp) does not depend on the order of its arguments py, 
Po, -+» Pe (condition £./), from what has been proved it also follows that 


EX Pa, Por +» +» Pi-1» Dis Diss» » - > Pi» Disa +++ Pre) 
= A Px, Poy ++ 5 Pin Pa + Pita +e e+ 1H Dj Pjsay s+ +> Pe) + (Di + Pisa +--+ + Py) 


x H(i, ee ee 1 i<j<k, 
Pit... t+Py Dito. +2; Pit--- +d; ss : 


and, in general, 


A( Pr, ++ Pi Pij4>- ++» Diss Pigtty> +> » Pig» tee a Pigttss +> » Pr) 
=Mpat+... + Piys Piyti + + + 4+ Pigs ++» Pigtt +--+ + Pr) 


Se eee 
+ (pi + + Pix) (gS, Prt... +PH 


Pax a sn 
OME <i ) 
( éy41 + Piz) Pit + ee + Dig Pit... + Pis 


+ (pia ++ poe ( Pa ..., Pe), 
Pis+1 Pr) Pigtt t-+> + Pr ” Digtt $.. + DP 


l<i<ig<ig<... <i, <k, (2) 
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This equality, fairly complex in form, expresses in most general terms the addition law of 
entropies enunciated in Sec. 2.2. 

The extension (2) of equation (1) will be needed by us not in its own right but in its appli- 
cation to the function f(k). We assume that k = /m, where / and m are some integers, and 
that the k = /m probabilities p,, p2,..., pz, entering formula (2), are all equal to each other 
(and, consequently, equal to !//m). In sucha case, the left-hand side of this formula is equal 


to f(im). We further assume that the groups (Pi, - ~~, Di)s (Piqats ++ +» Pig) s +++ > (Digets 
. , Pe), appearing in the equality (2), consist each of / numbers; in such a case the number 


of such groups ism. In addition, we have 
oy ey eee eS ene erm eee send ae 
P+... Pi = Pit SORES Pig = 6+ + = Pigtt + ee Py = at ee a 


Hence, the first line in the right-hand side of equality (2) reduces to H(1/m, 1/m,...,1/m) = 
J(m). Concerning the remaining members on the right-hand side of (2), the number of these 


members is equal to m and they are all given by 


Pi Pi 
cathe H("*__ ,..., Pa ___ ) 
a aa Ul OR eres 7 


eo ee. ne a 
See ee 5 eee Caen 


Thus, in the case considered, equation (2) assumes the simple form 


film) = fom) +m x + fly = fom) + fi). (2a) 


From (2a) it follows in particular that 
f(k*) = f(k x k) = fik) + f(ke) = 2f(k), 
S() = flk? x k) = fk*) + fk) = 3f(), 
Sik’) = fike x ky) = 4f(k), 
and, in general, that 
f(k*) = nf(k). (2b) 


We know that relation (2a) holds for the function f(k) = clog k. It is also routine to 
show that c log & is the only function that satisfies relation (2a) and condition E.4. In fact, 
suppose that & and / are two arbitrary positive integers. Choose some other large integer N 
and find a number a such that 


NM ckN < [nt, 
By E.4, 
Sim) <S(RN) < f(le*), 


fit is trivial to be convinced in that, ifi; = i, ip = 2i, i; = 3i,...,k = (s + 1)iand the 
variables py, Pa, .. +, Piy3 Piy+a Piya» -- +» Pigs» -- are the probabilities of outcomes 4,B;, 
A,B, ..., A,B;; A,B,, A.B, ..., 4.B;; ... of a compound experiment «8 (the sums 
Pit Pat... + Piss Piss + Dita +--+ Pigs --- are equal in sucha case to the probabili- 
ties of outcomes A,, A, .. . of an experiment a), ‘hen the equation (2) turns into the addition 
law of entropies, , , 
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or, by virtue of (25), 
af(!) < Nf(k) < (a + 1) f(). 


This implies that 


We now note that from /” < kN < /*+ jt follows that 


niog! < Nlogk < (n + 1) log/, 
or 


n log k n+1 
WN * jos? ary aa 


Thus, the ratios f(k)/f(/) and log k/log 7 lie within one and the same narrow bounds and, 
consequently, must be close to each other : 


fit) _loek) 2 1 
FQ) log / N° 


But since the last inequality holds for every value N, it follows thal 


fk) _ logk 
J) ~ log?’ 
or 
fk) _ fd 
logk logl’ 


This relation holds for each of the two numbers k and /; consequently, 


Sky SQ) ca 
logk  logi , 
where c does not depend on k and /, and hence, 


S(k) = ¢ log k. 


But since f(£) is an increasing function, therefore c > 0. 
We now suppose that p,, po, 


., Px ate arbitrary fractions 


1 = 4 — 4k 
A= >> hee aoe Se 


P 


(q»-- 


» 9x, p being integers and p being the common denominator of all these fractions), 
such that all of them are less than unity and p, + pp +... + pe = 1. According to formula 
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(2) (p. 97), we have 


fir) =8(S. Fees : ) 


Po 
L——.-_--_ 
p times 
H( 1 1 1 1 1 1 1 1 1 
= Tg TE gee ps Tg ap Ct me) al Se aa 
Pp p P Pp P P P P P 
-——~—__~ — ———-- 
q, times qq times dx times 
1 1 1 
BE Bas ea ie Fes 
Pp p Pp Pp 1 UE q 
Caer aa 
q, times 
qa 1 1 1 ) dk 1 1 1 
+ —HA—, —,...,— ee ae SS ee 
P 2 qa Qe * Pp qk Wk qk 
—————-— —-—_-~ ee 
q2 times qy limes 


= FU py, pa,» - » Pe) + Py S41) + P2f(G2) + .-- + Pr f(x). 


This implies that 


A Pr, Pay» ~~» Pe) = A(P) — Pif(ay) — Pef(qa) — «~~ — Prf(4x) 
= (py + Pat... + Px)MP) — Pifl4i) — Po f(42) — ... — Prix) 
= Pi(f(p) -- f(41)) + Pal fp) — fQa)) + ~~ + Pe(S(P) — A4e))- 


But since 


S(p) — f(a) = clog p — clog qa = —c log = —clogp,, 


S( Pp) — S42) = —c log pa,..., f(p) — f (an) = —c log py, 
we finally obtain 
A Py Pa». . » Pe) = C( —P; log py — pz log pz — ... — py log px). 


The last equality has so far been proved only for rational values p;, Po,...+ PR. But by the 
continuity of the function (pi, p2, ..., px) it follows that it is true for every pi, po... . Pr: 
This completes our proof. 


oO 


The Solution of Certain Logical Problems 
by Calculating Information 


3.1. Simple examples 


In order to illustrate the practical versatility of the concepts and propositions 
of Chapter 2, we analyze here some amusing problems of the sort collected by 
Kordemskii [59]. In Sections 1 and 2 we shall formulate some specific examples 
of such problems and here we shall frequently use heuristic arguments based 
on the intuitive notion of information. A deeper and more rigorous discussion 
of the reasonings in these sections will be postponed to the concluding Section 3 
of this chapter. 

We start with the well-known logical problem concerning a ‘town of liars and 
a town of non-liars,’ which is quite popular among mathematics enthusiasts in 
high schools. 


Problem 23. Suppose we know that the inhabitants of a certain town A always 
tell the truth, while those of a neighbouring town B always lie. An observer O 
knows that he is in one of these two towns but does not know specifically which 
one. By interrogating a person he encounters O must determine the town he is in, 
or the town in which his collocutor resides (residents of A can visit B and vice 
versa), or both facts together. What is then the least number of questions O must 
ask (the collocutor is to answer only ‘yes’ or ‘no’ to all questions asked by O) ? 

Suppose that O must determine the town he is in. Here the experiment 8, 
whose result is of interest to him, can have two outcomes (this experiment con- 
sists of finding out in which of the two towns, A or B, the observer O is). If we 
assume that O has no information beforehand as to which of the two towns he 
is in, then these two outcomes should be considered as equally possible and con- 
sequently, the entropy H(@) of B(i.e., the ‘total’ amount of information contained 
in the outcome of this experiment) equals one bit. Furthermore, the experiment 
«, in which O puts one question to the collocutor, can also have two outcomes 
(the latter may answer ‘yes’ or ‘no’); hence the entropy H(«) of this experiment 
(i.e., the ‘total’ amount of information contained in the answer to the question 
asked) ts at most equal to one bit. The question that now arises is whether 
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experiment a can be so set up that the information I(«, 6) contained in « about 
experiment B equals the entropy H(8) = | of B, i.e., that the outcome of « com- 
pletely determines the outcome of 3. Let us now recall that the sole relationship 
between the information J(«, 8) and entropy H(«) consists of the facts that 


I(2, 8) < A(a) (since I(«, 8) = H(«) — Ha(a)). 


Since H(«) can equal 1, we can expect in general that, subject to a successful 
choice of experiment (i.e., the question) «, the equality 


I(a, 8) = H(8) 


may hold. For this, the only requirements are that the question pertaining to 
experiment « be such that an affirmative or negative answer to it is equally pro- 
bable} (it is only in this case that the equality H(«) = 1 = H(§) holds), and 
that the outcome of experiment B determines that of « (it is only subject to this 
condition that the equality J(¢,8) = H(«), or Ha(«) -= 0, holds, indicating that 
the question pertaining to experiment « is ‘directed straight’ to ascertaining the 
outcome of 8 and an answer to this question contains no ‘extraneous’ informa- 
tion). All these restrictions are satisfied by the question ‘Do you live in this 
town?’, which completely solves the problem. (A positive answer to this question 
can be given only in town A and a negative answer only in town B.) 

It is quite obvious that O can ascertain the town in which his collocutor re- 
sides by asking a single question: for this it suffices to put any question, whose 
answer is known to O beforehand (say, ‘Am I in a town ?’, or ‘Does 2 X 2 make 
four ?’). 

If, however, O has to know both the town he is in and the town in which his 
collocutor resides, then he is called upon to determine the outcome of the joint 
experiment 6,8,, where 8, consists of determining where O is and 8, the place of 
residence of his collocutor. The entropy H(8,8.) of this experiment is greater 
than the entropy H(8,) of 8, : H(8,8.) = H(G,) + Hp,(B.) (see Sec. 2 of Chap- 
ter 2). In other words, in this case the information required is greater than | 
bit (recall that H(8,) = 1). Since the entropy H(«) of an experiment « (which 
consists of asking a question) with two outcomes cannot exceed 1, a single 
experiment « does not provide an opportunity to obtain information equal to 
H(B,82), i-e., does not enable us to determine completely the outcome of 6,8, 
(except for the completely uninteresting case in which the conditional entropy 
Ha,(8,) is 0, i.e., in which the outcome of 8, determines the outcome of §,; 
such is the situation when the residents of A cannot enter B and conversely). 


TSubject to the condition that O be in either 4 or B and that his collocutor does reside in 
either A or B are equally probable. 


3.1 sIMPLE EXAMPLES 103 


Thus, an estimate of the amount of information yields us a rigorous proof of the 
fact that a single question (no matter how it is put !) does not enable us to 
determine directly both the town in which O is and the town in which his col- 
locutor resides. If, however, O puts two questions (i.e., carries out a joint ex- 
periment «,«,, having four possible outcomes), then he can indeed ascertain the 
outcome of experiment $,B: (the outcome of 8, can be determined with the aid 
of the question pertaining to experiment ¢,, and that ef B, by the question per- 
taining to experiment «,). 

In the next problem, some of the hypotheses of Problem 23 bear a more 
complex character. 


Problem 24. Suppose that there are three towns A, B and C. The inhabitants 
of A always tell the truth , those of B only tell lies and those of C alternately tell 
the truth and lie. An observer O desires to find out the town in which he is and 
the town in which a person he encounters resides. How many questions need he 
put to his collocutor if all the questions are to have only ‘yes’ ar ‘no’ answer? 

Here we must determine which of the nine possible outcomes of experiment 
B is realized (O may be in any one of the three towns A, B and C and, inde- 
pendent of this, his collocutor may reside in any one of the same three towns). 
If we assume that O has no prior information about experiment 8, then all 
these nine outcomes can be considered to be equally probable and the entropy 
H(B) of 8 (and, consequently, also the amount of information obtained by ascer- 
taining the outcome of 8) equals log 9. Suppose that the joint experiment 


Ar = %1%...% consists of having O ask k questions. Since he may receive 
an affirmative or a negative answer to each question, the entropy of each experi- 
ment %,, %),..., 4, does not exceed one bit. On the other hand, 


A(%e,) = H(%) + Hoyt.) < A(%,) + A(a,) 
(because Ha,(%.) < H(a2)) and similarly, 
H(Ax) = Alay... %&) S H(%,) + Hla) i...) Hae) Sk 


(a rigorous proof of this inequality is easy to obtain by mathematical induction). 
This can be verbalized differently as follows: If the answer to each question 
yields us information not exceeding one bit, then by asking & questions we can 
obtain information not greater than k bits. Hence if k = 3, then the informa- 
tion given to us is less than log 9 (it can at most equal 3 = log 8 < log 9) and, 
thus, three questions will not ensure that we can always determine both the 
place in which O is and the place in which his collocutor resides. However, four 
adroitly put questions may possibly do the trick (because it can then be asserted 
that H(A.) < 4 = log 16). Indeed, it is easy to see that the following four 
questions do assure the revelation of all that is of interest to O: 
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(i) Do I happen to be in one of the towns A and B? 
(ii) Do I happen to be in town C ? 
(iii) Do you reside in town C ? 
(iv) Do I happen to be in town A ? 


In fact ‘yes’ or ‘no’ answers to both the questions (i) and (ii) immediately indicate that the 
collocutar of O resides in C. Suppose, for instance, that the answers to-both of these questions 
are in the affirmative (the case in which both answers are negative is analyzed similarly). In 
this case, a negative (obviously incorrect) answer to question (iii) implies that the answer to 
question (ii) is correct and eliminates the necessity of further asking question (iv); a positive 
(correct) answer to question (iii) means that the answer to question (i) is trustworthy and in 
order to find out the town in which O is, it 1s necessary to put question (iv) (the answer to 
which is known to be incorrect). An affirmative answer to (i) and a negative answer to (ii), as 
well as the converse position, indicate that the collocutor of O resides in A or B. Insucha 
case, a negative (i.¢., correct) answer to question (iii) shows that the respondent resides in A 
and question (iv) is needed only if the answer to (ii) is negative; a positive (incorrect) answer 
to question (iii) means that the collocutor of O resides in B and question (iv) is found neces- 
sary only if the answer to (ii) is positive, 


The following is one more example of a similar sort of problem (see Problem 
283 in [59}). 


Problem 25. How many questions are necessary to determine a positive integer 
thought of by a collocutor assuming that the concieved integer does not exceed \0 
(or 100, or 1000 or an arbitrary positive integer ) and only ‘yes’ or ‘no’ can be 
given as answers to all questions ? 

Suppose we know that the thought of number does not exceed 10. In sucha 
case, an experiment 8, consisting of the determination of this number, can have 
10 different outcomes. Until the first question is put and answered, we can 
consider all these outcomes to be equally probable so that the entropy H(@) of 
8 (i.e., the requisite information) equals log 10 = 3.32 bits. We consider a joint 
experiment A; = aia, ..., in which k questions are asked. The entropy of 
%,, where a, consists of asking a single question, does not exceed | bit since a, 
can have only two outcomes (positive and negative answers to the question); 
hence the entropy of A; does not exceed k bits (see p. 103). On the other hand, 
the information J(A;, 8) concerning experiment @ that is contained in the joint 
experiment Az cannot exceed the total information contained in the outcome 
of Ax, i.e., the entropy H(Ax). In order that the outcome of A, completely 
determine the outcome of (@, it is necessary that the equality /(Az, 8) = H(8) 
hold. Hence, we conclude in this case that 


log 10 = H(B) = I(Ag, B) < H(Ax) Sk, 


k > log 10 = 3.32, 
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and since k is an integer, 


k2> 4. 


Let us now show that by asking just four questions the outcome of 8 can indeed 
be completely determined, i.e., we can find the number x that was thought of. 
It is easy to visualize the procedure to follow for this purpose. In the first place, 
it is natural to put the first question in such a way that the information contain- 
ed in the answer to it, that is, the entropy H(«,), is the maximum possible. In 
other words, the information actually equals one bit. For this, it is necessary 
that both outcomes of our experiment «, be equally probable. The further 
requirement is that the information J(z,, 8) about B, contained in «,, be equal 
to but not less than the entropy H(«,) of a,. This demands that the answer to 
the first question contain no ‘extraneous’ information, i.e., that the conditional 
entropy Hpa(«,) be zero (in other words, the outcome of «, is fully determined 
by the outcome of 8). These considerations clearly prescribe how the first ques- 
tion ought to be put. We partition the set of all possible values of x (i.e., the 
set of positive integers from 1 to 10) into two numerically equal parts (since the 
two outcomes of «, must be equally probable) and then ask to which of these 
two parts x belongs. Thus, we may ask, say, if x is greater than 5. In this case, 
obviously, 


I(a,, 8) = H(8) — Ha(8) = 1, 


i.€., 


Ha,(8) = p(A,) Ha,(8) + p(A2) Hao(B) = H(B) — 1 


(A, and A, are the two outcomes of «,; p(A,) = p(A.) = 4); in addition, 
Ha\(8) = Ha,(8) = H(8) — 1, 


so that for every outcome of «,, the entropy of the experiment B we are inter- 
ested in, decreases by | bit. Furthermore, in exactly the same manner, we divide 
the new set of permissible values of x into two equal (or, at least, approximately 
equal) parts, and determine to which of them x belongs (if x is greater than 5, 
then ask whether this number is larger than 7; if, however, x does not exceed 5, 
then question whether x is larger than 3), and So on. Each time, by partitioning 
the set of admissible values of x into two parts, as numerically equal as possible, 
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we can certainly determine x by asking only four questions. 

In exactly the same way, we can show that the least number k of questions 
enabling us to determine an unknown number x, which may have 100 or 1000 
values, iS given by the inequality k > log 100 = 6.64 and, correspondingly, by 
k > log 1000 = 9.97. Since in all such cases k is an integer, this implies that 


k > 7, or (correspondingly) k > 10. 


In general, the least number k of questions enabling one to find an unknown 
number x having one of n admissible values is given by the inequalities 


k—l1<logn<k (or2* <n 2°). (1) 
We note also that 
k > log x, 


in all the cases; moreover, k = log n if and only if the number # is an integral 
power of 2 and, consequently, log is an integer. However, when n is very large, 
the difference between the numbers k and log n is found to be quite small in 
comparison to these numbers themselves (because, for large n, the quantity log n 
is also large and the difference k — log n does not always exceed unity). Thus, 
we can assume that for large n, the ratio of log n (the entropy of 8 under con- 
sideration) to the information (1 bit) about § contained in experiment « (which 
consists of finding the answer to a single question), quite precisely indicates the 
number k of experiments that are involved to determine the outcome of 8. 

At first sight, Problem 25 appears to be as artificial as its two predecessors; we 
shall see later, however, that it has serious engineering applications.{+ A more 
detailed discussion of the solution of this problem (including also a more general 
formulation of its conditions) is deferred to Sec. 3 of this chapter. 

The next problem is quite similar to Problem 25. 


Problem 26. A person thinks of two (distinct) numbers not exceeding 100. How 
many questions are necessary to find these numbers if each question can have only 
‘yes’ or ‘no’ answers ? 


tObviously, after we find that the number x has one of m values, where m is odd (say, 
m = 5), we cannot secure strictly equally probable outcomes of the succeeding experiment 
& 341, because m possible values of x are impossible to split into numerically equal parts. Hence 
the entropy H(«;,,) of experiment «;,, will be Jess than 1. This implies that our questioning 
will not be most profitabie from the viewpoint of information obtained, i.e., that by using the 
same number of questions, we can find an unknown numter even when the set of its possible 
values is a larger number (thus, using four questions we can find an unknown number which is 
not just one of 10 but even one of 24 = 16 possible values). 

TfIt should nevertheless be indicated that in spite of the recreative formation of Prob- 
lems 23—24, a sufficiently serious meaning lies concealed in them (see pp. 121-123). 
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In this case, experiment B whose outcome must be determined, can have 
( 100 

2 
to be equally probable, then the entropy H(§) of 8 (i.e., the information that 
we obtain after having determined the outcome of 8) equals log 4950. But, 
since the information that can be provided by an answer to a single question 
does not exceed one bit (because experiment «, which consists of asking a single 
question can have but two outcomes ‘yes’ or ‘no’), the least number of questions 
we must ask to be able to always determine the outcome of 8 can never be less 
than log 4950 = 12.27 (cf. the solution of Problem 25). Thus, if we ask less 
than thirteen questions, it is certainly possible that we shall not succeed in deter- 
mining both of the unknown numbers. 

It is also easy to see that thirteen adroitly put questions always enable us to 
find the two numbers. In order to achieve this, it is necessary that the informa- 
tion I(«, 8) obtained concerning the outcome of experiment 8 contained in the 
outcome of experiment « in which a single question is asked (more precisely, in 
which each of the questions is asked), be as close to one bit as possible. Hence, 
it is plain that questions are necessarily so put that the answers ‘yes’ and ‘no’ 
are equiprobable or nearly equiprobable. And for this purpose, it suffices that 
to begin with, we partition the 4950 outcomes of 8 into two numerically equal 
parts (such that each part contains 2475 outcomes) and determine to which of 
these parts the real outcome of 8 belongs (i.e., we should ask in the first place 
whether or not the two unknown numbers belong to the group containing the 
first 2475 pairs of numbers). Next, in exactly the same way, it is necessary to 
divide into two numerically equal parts (as far as possible) that group of out- 
comes to which the outcome of our interest belongs, and then determine to 
which of these two smaller parts the two unknown numbers belong, and so on. 
It is clear that here we invariably determine the pair of unknown numbers with 
the aid of not more than thirteen questions. 

We further remark that the distinction between Problems 26 and 25 can be 
considered to be purely verbal. It is clear that in solving Problem 235 a role is 
played only by the total number n of those numbers, one of which is the number 
thought of. In addition, obviously, we can always consider these m numbers to be 
indexes of arbitrarily chosen objects, say, n indexes of m cars, orn pairs of numbers, 
Orn given arbitrary groups of numbers, and so on—this has no influence on the 
solution of the problem. However, if we consider that 2 in Problem 25 equals 
4950 and that the considered 4950 numbers index that set of all possible pairs 
of numbers, each of which does not excecd 100, then we arrive at Problem 26. 

In exactly the same way, we can show that the minimum number of questions 
we need ask to determine the m conceived numbers, not exceeding n, equals the 


) = 4950 different outcomes. If, as usual, we consider all these outcomes 


least integer k such that k D log ( is ) If, however, it is known, say, that either 
one number not exceeding z is thought of, or no number is thought of, then in 
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order to find out whether a number has been thought of and if so, precisely what 
number, it is nccessary to put questions not less than log (n + 1) and not more 
than log(n + 1) + 1. In fact, in this case, the number of possible outcomes 
of the corresponding experiment 6 is n + 1 (the unity in this sum corresponds 
to the case where no number is thought of). Finally, if we assume that not more 
than m numbers are thought of, where m < n/2, each of which does not exceed 
n, then the number of questions necessary to determine how many and exactly 
what numbers were thought of lies between 


oe (tlw na) + (GT) +3} 
we[(*)+(, tert (Tet fee 


e )+...4 


m-- 1 


and 


In fact, the experiment B considered here can have { i: ) oe ( 


a ( ; ) -+ 1 different outcomes (because what was thought of may turn out 


to be a group in the ee groups of m numbers, or one of the (,, . :) 


n ae 
groups of m — | numbers,..., or one of ( 1 ) = n individual numbers, or even 


no number at all). Renumbering these N = ( , )+ @e :) +... 


+ ( i ) + 1 outcomes of experiment # as the numbers from 1 to N, we arrive 


at Problem 25 (where the number z has been replaced by N). Later, we shall 
make further use of this remark. 


3.2. The counterfeit coin problem 


The starting point of this section is the following problem, closely allied to 
Problem 25. 


Problem 27. There are 25 coins of the same denomination. Of these, 24 are of 
identical weight and the one counterfeit coin is slightly lighter than the others. 
The question is how many weighings on a beam balance are necessary to find the 
counterfeit coin ? (See Problem 277, 1 and 2 in [59].) 

The experiment 8 whose result must be determined has in this case 25 possi- 
ble outcomes (any of the 25 coins may turn out to be counterfeit). It is natural 
to suppose all these outcomes to be equally probable so that H(8) = log 25. In 
other words, the determination of a counterfeit coin is related in the given case 
to obtaining the information measured by the number log 25. The experiment 
#,, consisting of one (arbitrary) weighing, can have three outcomes (the left or 
the right beam may be lighter or both may be equal); hence, H(«,) < log 3 and 
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the information J(«,, 8) obtained from such an experiment does not exceed 
log 3. We now consider the joint experiment Az = «0... . % consisting of k 
consecutive weighings; it gives information not exceeding k log 3 (See p. 103). 
If experiment Ax enables us to determine completely the outcome of the experi- 
ment 6, then we must have 


H(Ax) > I(Az, 8) > A(B), or k log 3 > log 25. 
Hence, we infer that 3* > 25, i.e., 


log 25 


k > log, 25 = Tog 3° 


and since k is an integer, we must have 


k > 3. 


It is easy to show that with the aid of three weighings the counterfeit coin 
can be found. If we want to gain the maximum possible information from ex- 
periment «1, it is necessary that the outcomes of this experiment be (as far as 
possible) equally probable. Suppose that m coins are placed on each beam 
(clearly it makes no sense to put different numbers of coins on two beams: in 
such a case the outcome of the corresponding experiment is known beforehand, 
and the information obtained is 0); the number of coins not placed on the 
balance is equal to 25 — 2m. Since the probability that the counterfeit coin 
will turn up in a given group of ” coins is 1/25 (because all outcomes of experi- 
ment 8 are considered equally probable !), the three outcomes of experiment %, 
have the probabilities m/25, m/25 and (25 — 2m)/25. These probabilities are 
closest to one another when m = 8 and 25 — 2m = 9. If 8 coins are placed 
on each beam, the first weighing (experiment «,) allows us to select a second 
group of 9 coins (if the beams are equal) or 8 coins (if one of the beams is 
lighter), one of which is counterfeit. In both cases, in order to obtain the maxi- 
mum information from the second weighing (experiment «,) it is necessary to 
place three coins from this group on each of the two beams of the balance; in 
such case the joint experiment «,«, permits us to select a group of 3 (or of 2) 
coins, one of which is counterfeit. In the third weighing (experiment «,), we 
place one of the remaining suspect coins on each of the two beams of the balance 
and easily find the counterfeit coin. 

In exactly the same way, we can show that the least number k of weighings, 
that enable us to determine a single counterfeit (lighter !) coin contained in a group 
of n coins, is given by the inequalities : 

log n 


k-1 < k as, eas —AS8 
ark ow = 3" ork logs (2) 
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If n is large, then this number & is given, with a sufficient accuracy, by the ratio 
log n/log 3, i.e., by the ratio of the entropy of experiment ®, consisting of the 
determination of the counterfeit coin, to the maximum information which can 
be obtained in a single weighing (see p. 106). 

In the following, we shall use a similar conclusion related to an even more 
general setting of the problem. In the first place, it is clear that if we have 
n coins with one counterfeit among them and we know that the counterfeit coin 
is Slightly heavier than the others, then the least number of weighings k on a 
beam balance that enables us to detect this counterfeit coin, is given by the 
same inequalities (2): in practice, the substitution of a heavier coin for the 
lighter one does not alter our arguments. We now consider a more general case 
in which our # coins are divided into two groups; group A, conta‘ning a coins and 
group B, containing b = n — a coins, it being known that one of these n coins is 
counterfeit and that if this coin belongs to group A (resp. B), then it is lighter 
(resp. heavier) than the rest, and show that here also the least number of weighings 
k that enables us to find the counterfeit coin is given by inequalities (2).t For 
b = 0, this statement reverts to the previous case. 

In fact, it is clear that the experiment 8 in which we are interested can obvious- 
ly have n different outcomes. Hence 3* 3 n; otherwise, the experiment A,=«,4, 
... %, consisting of k subsequent weighings, can in no way uniquely determine 
the outcome of 8 (because in this case I(Az, 8B) < H(Ax) < k log 3 = log 3* 
< log n = H(8); the outcomes of 8 are considered, as usual, to be equally prob- 
able). On the other hand, when n < 3* the counterfeit coin can always be 
separated out by k weighings; this is easy to show by using, say, mathematical 
induction. In fact, ifk = 1, i.e., 2 = 1, 2 or 3, then our assertion is almost 
obvious (with the one exception indicated in the preceding footnote) : forn = 1 
the counterfeit coin is known always, but for 1 = 2 (and a = 2 or b = 2) and 
for n = 3, it suffices to compare the weights of two coins from one group in 
order to determine the counterfeit one. We now suppose it to be already known 
that for n < 3* the counterfeit coin can always be separated out with the aid of 
not more than k weighings. Let 3° < n < 3**, It is easy to see that in this 
case we can always select an even number 2x of coins from group A and an 
even number 2y of coins from group B such that the numbers x and y satisfy 
the conditions 


2x + 2y <2 x 3*, nm — (2x -+ 2y) < 3*, 


i.e., 
n — 3* 
3 >xly>D 7 
+This statement has one obvious exception: ifm = 2,a = 6 = 1, thenit is obviously quite 


jmpossible to separate out the counterfeit coin. 
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We now place x coins from group A and y coins from group B on each beam. 
Then, the number of coins not placed on the balance is m, = n — 2x — 2y < 3*, 
If the beams balance for this weighing (experiment «,), then we infer that 
the counterfeit coin is among the m, coins not on the balance, i.e., among 
the a, = a — 2x (resp. b, = b -- 2y) coins from group A (resp. B) not involved 
in the first weighing. If one of the beams is lighter, then one of the x coins 
from group A lying on the lighter pan or of the y coins from group B lying on 
the heavier pan is counterfeit. However, since n, < 3" and x + y & 3+, by the 
assumption made we are able to separate out the counterfeit coin in both the 
cases by not more than k weighings.t Consequently, from our n < 3** coins, 
we can certainly find the counterfeit coin by making not more than k + 1 
weighings. This conclusion also completes the proof of the statement made 
above. 


We now consider the following problem, which is slightly more complicated. 


Problem 28. There are 12 coins of the same denomination, of which 11 have 
identical weight and the remaining one is counterfeit, having a weight different 
from all the rest (it being unknown whether it is lighter or heavier than the genuine 
ones). What is the least number of weighings on a beam balance that will enable 
us to find the counterfeit coin and determine whether it is lighter or heavier than 
the rest of the coins? Solve the same problem also for the case of 13 coins (see 
Problem 277 (3) in [59] or Problem 6(a) in [62]). 

We consider here an experiment B having 24 or 26 possible outcomes (any : 
one among the existing 12 or 13 coins may be counterfeit, and this coin may be 
either lighter or heavier than the genuine coins). If all these outcomes are consi- 
dered to be equally probable, the entropy H(6) of B equals log 24 or log 26. Thus 
we are required to obtain log 24, or correspondingly log 26, units of informa- 
tion. Since from the joint experiment Ar = a,%.... %,, consisting of k weigh- 
ings, we can obtain information not greater than & log 3 = log 3*, and log 33 = 
27, at the first sight it seems plausible that in the case of 12 or 13 coins, three 
weighings will enable us to find the counterfeit coin and also to decide whether 
it is lighter or heavier than others. In reality, however, in the case of 13 coins 
three weighings may be found to be insufficient; this fact is quite simple to show 
with the aid of a somewhat more careful evaluation of the information obtained 
from the first weighing. 

In fact, the first weighing may consist of placing 1, 2, 3, 4, 5 or 6 coins on 
each beam. We denote the corresponding experiments by «{", where i can equal 


If n > 2, then the case in which x = y == !, or a, = 5, == 1 no longer constitutes an 
exception. In fact, apart from one suspect coin from group A and one from group B, we 
now have a certain number of coins that are known to be genuine; by comparing the weight 
of one of them with that of one from the suspect coins we shall be able to find the counter- 
feit coin from one weig ling. 
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1, 2, 3, 4, 5 or 6. If i equals 1, 2, 3, or 4 and, as a consequence of the first 
weighing, the beams remain balanced, then experiment a indicates that one 
of the 13 — 2; coins not on the balance is counterfeit. Since this number is not 
less than 5, 10 (or still more) different outcomes remain possible after the first 
weighing. Therefore, the two succeeding weighings may not guarantee the 
detection of the counterfeit coin and the clarification of whether it is lighter or 
heavier than the rest (because 2 log 3 = log 9 < 10). If i equals 5 or 6 and, 
in experiment «{*), one beam (say, the right one) is heavier, then «{f) indicates 
that either one of the i coins on the ‘right’ beam is counterfeit and heavier, or 
one of the 7 coins on the ‘left’ beam is counterfeit and lighter. Thus, here also, 
we are Still left with i + i = 2i > 10 possible outcomes of experiment 8, and 
again two weighings are insufficient to ascertain which outcome actually holds. 

We now pass on to the case of 12 coins. Suppose that, in the first weighing, 
we place i coins on each beam (experiment «{'’). If, in this, the beams remain 
balanced (outcome E of experiment «{*); we shall use similar notation in what 
follows), then one of the 12 — 2 coins not on the balance is counterfeit, which 
corresponds to 2(12 — 2i) equally probable outcomes of the experiment 6 under 
consideration (from the total number of 24 outcomes). If the right beam is 
heavier (outcome R), then either one of the 7 coins on the right beam is counter- 
feit and heavier, or one of the i coins on the left beam is counterfeit and lighter— 
these cases correspond to the 2i outcomes of 8 — in exactly the same manner, 
_ the case in which the left beam is heavier (outcome L) also corresponds to the 
2i outcomes of 8. Thus, the three outcomes of the experiment «{” have the 
probabilities 


2i2—2i) 6-i 2b oy doar ts 
Ye oY Ge es |) 
Hence, it immediately follows that of the six experiments a{!), a{?),..., a{, 


experiment a{*’ whose three outcomes are equally probable, has the largest 
entropy. Thus, the experiment «{*) gives us the maximum information and it is 
most expedient to start with it. We now consider the two cases separately. 


Case I. The beams are balanced for the first weighing. In this case, one of 
the four coins not on the balance is counterfeit. By means of two weighings 
we must find out which of these coins is counterfeit and also ascertain whether 
it is lighter or heavier than the others. Since we are left with2 x 4 = 8 possible 
outcomes of experiment 8 and 2 log 3 = log 9 > log 8, we may expect that this is 
possible. However, if just one of our four suspect coins is placed on each beam 
so that two coins are not on the balance (experiment «{!’) and the beams remain 
equal, then from the next weighing we must determine specifically which of 
the four outcomes that still remain possible occurs. This is clearly impossible 
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to do (since (4 > 3). If, however, we place on each beam a pair of our four 
suspect coins (experiment «%)) and one of the two beams is heavier, then we are 
again left with four still possible outcomes of experiment 8 and have to resort 
again to at least two further weighings in order to determine completely which 
of them occurs. This gives an impression that, in the case of 12 coins also, three 
weighings are insufficient to solve the problem. 

This inference is however premature. In fact, we have not taken into account 
that, after the first weighing, we have at our disposal 4 + 4 = 8 a fortiori genuine 
coins that can participate in the second weighing. Hence, we have considerably 
more than two possible variants of experiment «2. Let us denote by of) an 
experiment in which we place on the right beam i of our four suspect coins and 
on the left beam j < i of them as well as i — j definitely genuine coins 
(obviouly, it makes no sense to place the genuine coins on both the beams). In 
such a case «{") and a{? are those experiments o$’ and «!*) that we considered 
above. We denote by p(R), p(L) and p(E), respectively, the probabilities that 
in experiment «$’*”) the right beam is heavier, the left beam is heavier, or both 
are equal. These probabilities are easy to calculate; they equal the ratio of the 
number of those outcomes of 8 for which «$4 has outcome R, correspondingly, 
L or E, to the total number of remaining possible outcomes of B (this number 
is 8). Since i + j < 4, obviously all experiments «$' are easy to enumerate; 
the values of the probabilities p(R), p(L) and p(E) corresponding to them are 
listed in the table on p. 114. In the last column of the table, we have given 
also the entropy (in bits) H(«S’”) of experiments 04’, which equals — p(E) log 
P(E) — p(R) log p(R) — p(L) log p(L). 

From this table, we see that experiments «{ and a{ have the largest 
entropy. Hence, in order to gain the maximum information, it is necessary in 
the process of the second weighing either to put two of the four suspect coins on 
one beam and one of suspect coins and one definitely genuine coin on the other 
beam, or to put three suspect coins on one beam and three definitely genuine 
ones on the second beam. It is easy to see that, in both cases, we can then 
complelely determine, by the third weighing, the outcome of 8. Indeed, if 
experiment «§?>’ or «{°) has outcome E, then the only suspect coin not on the 
balance in the second weighing is counterfeit; in order to find out also whether 
it is lighter or heavier than the others, it is necessary to compare its weight with 
that of one of 11 definitely genuine coins (this is the third weighing). If experi- 
ment «{’ has outcome R, then either one of the two coins on the right beam is 
counterfeit and is heavier than the others, or the lone suspect coin on the left 
beam is counterfeit and it is lighter than the genuine ones. Comparing the 
weights of the two coins on the right beam (by a third weighing) we are able to 
know the outcome of 8 (if these coins have the same weight, then the third of 
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i j P(E) P(R) p(L) H(at"”) 
I 1 $ 4 4 1.50 
1 0 3 4 i 1.06 
2 2 0 } 4 1.00 
2 1 } a 3 1.56 
2 0 $ 4 i 1.50 
3 1 0 } } 1.00 
3 0 4 3 3 1.56 
4 0 0 4 4 1.00 


the suspect coins is counterfeit—otherwise, the one that weighed more). If 

experiment «°°? has outcome R, then one of the three coins lying on the right 

beam is counterfeit and is heavier than the genuine ones. Comparing the weights 

of two of these coins (by a third weighing), we can find the outcome of 8 (either 

the heavier one is counterfeit, or if they are equal, the third coin is counterfeit), 
(251) 


Similarly, we can also analyze the cases in which experiment «{”’ or af” has 
Outcome L. 


Case II. One beam of the balance (say, the right one) is heavier for the first 
weighing. In this case, either one of the four coins on the right beam is counter- 
feit and heavier than the others, or one of the four coins on left beam is 
counterfeit and lighter. In the second weighing, we can place on the right beam 
i, coins from the right beam and i, coins from the left beam, and on the left 
beam j, coins from the right beam, j, from the left beam and (i, + i2) — (j1 + je) 
definitely genuine coins not on the balance during the first weighing (experiment 
eg'tsi2141 42); assume that i; + ig > j, + je). Here also a table of the entropies 


of experiments o4'1-42:412) can be composed for all possible values iy, i,,j, and jos 
however, since the number of possible variants is fairly large here, it is expedient 
that some of them be excluded from the very start. 

We note that, since the information we expect to gain from the third weighing 
(experiment «) about the outcome of 8 does not exceed log 3 (because H(a,) < 
log 3), after two weighings we must be Icft with at most three possible outcomes 
of experiment 8; otherwise, experiment «, will not allow us to determine uniquely 
the outcome of 8. Hence, it is necessary in the first place that the number of 
suspect coins not on the balance in the second weighing does not exceed 3 (sinc¢ 
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in the case of outcome E of experiment %., it is precisely these coins that remain 
suspect). Thus, we have 
8—-Gatk +A th) 3, ie, i tk +f t+ pe 2 5, 
or, since i, + ip Aji t+ Ja, 
ti > 3, Jitje > 5 — Cy + ie). 

Furthermore, if experiment «$’¥’2:1?2) has outcome R, then either one of the i, 
‘right’ coins on the right beam is counterfeit and heavier, or one of the j, ‘left’ 
coins on the left beam is counterfeit and lighter. In exactly the same way, in 
the case of outcome L, one may suspect that the counterfeit coin is one of the 


i, ‘left’ coins on the right beam, or one of the j, ‘right’ coins on the left beam. 
Hence, we also obtain the following two inequalities 


i+j,&3 and i, + j, < 3, 
which must, of course, be satisfied. Finally, it is clear that the inequalities 
H+f64, bh+haS4 and Gi, +b) - (ith) <4 


must also be satisfied. 
We now list in the accompanying table all cases satisfying our conditions. 


i ip ff PE) LR) pL) (affinity 


2 i 3 1 } a 2 1.56 
2 1 2 0 3 } 2 1.56 
2 1 1 1 8 a 4 1.56 
1 2 1 2 } 3 3 1.56 
1 2 oO 2 3 2 ; 1.56 
1 2 1 1 3 } 3 1.56 
3 1 1 oO 8 8 } 1.56 
1 3 0 i g } 3 1.56 
2 2 J 1 } 8 3 1.56 
2 2 1 0 a } 3 1.56 
2 2 oO 1 a 2 i 1.56 
3 2 1 Oo k a 4 1.56 
2 3 0 1 4 3 2 1.56 
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Thus we see that here we have not two as in the preceding case but as. 
many as 13 variants of experiment @, which contains one and the same maximum 
information about experiment 8 (it is perfectly clear that here the information 
I(a,,'B) equals the entropy H(«,)). For any choice of experiment «,, this informa- 
tion is found adequate to allow us to determine completely the outcome of 
B with the aid of one more, that is, the third weighing. Thus, for instance, in 
the case of outcome E of the experiment «{%?'?” one of the two left coins not 
on the balance in the second weighing is counterfeit. Moveover, we also know 
that this coin is lighter than the genuine ones; hence, to determine which coin 
is counterfeit, it suffices to compare the weights of these two coins (or compare 
one of them with a definitely genuine coin). In the case of outcome R of the 
same experiment either one of the two ‘right’ coins on the right beam is counter- 
feit and heavier or the only ‘left’ coin on the left beam is counterfeit and lighter. 
Hence, it is sufficient to compare the weights of the two suspect ‘right’ coins. 
The case in’ which experiment «$" has outcome L can be analyzed in exactly 
the same manner. 

This completes the case of 12 coins. We may now recall the case of 13 coins 
and show that four weighings are sufficient in this case (we have shown earlier 
that three weighings cannot suffice here). We place four coins on each beam so 
that five coins are not on the balance. If one of the two beams is heavier, we 
have the same situation as encountered by us while analyzing the case of out- 
come R of the first weighing in the 12 coin problem (with the immaterial differ- 
ence that we now have not four but five definitely genuine coins). Hence, in 
this case, three weighings are sufficient to find the counterfeit coin and ascertain 
whether it is lighter or heavier than the others. If, however, the beams are 
balanced, then we have to single out the counterfeit coin from not four but from 
five suspects. Here we may begin by comparing the weight of any one of the 
suspect coins with that of a definitely genuine coin: if their weights are different, 
then our problem is immediately solved, otherwise, we are back to the case of 
four suspect coins, and then, with the aid of two weighings, we can determine 
the counterfeit coin and ascertain it to be lighter or heavier than the others (see 
Case I on p. 112 and onwards). 

The next problem now generalizes the conditions of Problem 28. 


Problem 29. There aren coins of the same denomination, of which one is 
counterfeit and is eithér lighter or heavier than the rest. What is the least number 
k of weighings on a beam balance that is necessary to find the counterfeit coin and 
ascertain whether it is lighter or heayier than the others (see [62], Problem 65). 

This problem is related to the examination of experiment B which may have 
2n outcomes. It is natural to consider all these outcomes to be equally prob- 
able; hence, the entropy H(@) equals log 2m. Moreover, the entropy of the 
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experiment Ay = a1%2... &, consisting of successive k weighings does not 
exceed k log 3 = log 3*; hence, we must have 


i k 
2n < 3*, thatis, ns + ’ 


or since and k are positive integers and 3* is odd, 


In other words, 


k > log(2n + 1) = SERED. 
Thus, say, if m > (33 — 1)/2 = 13, then the counterfeit coin cannot be found 
with less than three weighings. 

It is also easy to see that, even in the case in which n = (3* — 1)/2, k 
weighings do not always enable us to find the counterfeit coin and ascertain 
whether it is lighter or heavier than the others. (For example, when n = 13, 
the counterfeit coin may not be found in all cases from three weighings.) The 
proof of this is quite similar to the one given above for the particular case 
n= 13 and k = 3 (see the start of the solution of Problem 28). Indeed, for 
evaluating the entropy of experiment A, = aa, ...«, we have so far proceeded 
from the fact that the entropy of each individual weighing can equal log 3; in 
the present case, however, because n = (3* — 1)/2 is not divisible by 3, even 
the entropy of the first weighing (experiment «,) cannot attain this value (since 
the three outcomes of the first weighing can in no way be equally probable). 
Since n -- 1 = [3(3*-! — 1)]/2 is divisible by 3, it is clear that in the first weighing 
it is most advantageous to place the group 


n— 1 Pe aca 
3 a 2 


of coins on each beam, leaving the remaining group 


nt+2 3441 
-<: o 


of coins not on the balance : in this case the probabilities of the three outcomes 
of experiment «, (equal to (# — 1)/3: 2 = 4 — (1/3n), (n — 1)/3: 2 = 4 —(1/3n) 
and (n + 2)/3: 1 =: 4 -{- (2/3z)) are closest to each other and, hence, the entropy 
H(«,) of the corresponding experiment is greater than that in any other case. 
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However, it is plain that the amount of uncertainty that remains after this is such 
that it cannot be eliminated completely from k — 1 weighings. The simplest way 
to demonstrate this is as follows : we assume that in the first weighing the beams 
balance. In such a case, one of the group (n + 2)/3 = (3*-1 + 1)/2 of coins 
not on the balance is counterfeit, so that we are still left with 3*-2 + 1 equally 
probable outcomes of # (any of the (3*-! + 1)/2 coins not on the balance may 
turn out to be counterfeit and this coin may be either lighter or heavier than 
the genuine ones). After ascertaining which of these possibilities eventually 
occurs, we obtain the amount of information equal to log (3*-! + 1). This 
amount of information exceeds the maximum information log 3*-! = (k — 1) 
log 3 which can be obtained by k — 1 weighings. Similarly, we can show that 
for any other choice of experiment «, (the first weighing) this experiment can 
have an outcome for which the remaining k — | weighings will be insufficient 
for a unique determination of the outcome of 8. 
Thus, we see that if 


3* — 1 
2 


n> ? 


then k weighings may be insufficient. We now show that, if 1 < (3* — 1)/2 (e., 
if a < (3* — 3/2; in other words, if kK > log, (2n -+ 3) = log (2n + 3)/log 3)), 
then k weighings do suffice.t This conclusion completes the solution of our 
problem. 

We begin with the following auxiliary problem : suppose that in addition to 
n coins, of which one is counterfeit, we have at least one definitely genuine coin; 
it is required to find the counterfeit coin and ascertain whether it is lighter or 
heavier than the rest. In this case, as before, we can state that if n > (3* — 1)/2, 
then k weighings are insufficient (because obviously the amount of uncertainty 
of the initial experiment does not change because of the addition of a genuine 
coin). However, we cannot now be certain that, even when n = (3* — 1)/2, 
the k weighings must be insufficient. In fact, by taking into account the add- 
itional genuine coin, we can attain a greater closeness than before among the 
probabilities of the three possible outcomes of the first weighing and, con- 
sequently, ‘gain from this weighing a greater amount of information. For this 
purpose, it is necessary only to place on each beam a group of (n + 2)/3 = 
(3*-! + 1)/2 coins (one of the 3*-! + 1 coins used is the additional genuine 
coin), leaving the remaining (n — 1)/3 = (341 — 1)/2 suspect coins not on the 
balance. In this case, it is easy to see that the probabilities of the individual 


tThis statement has two obvious exceptions : if m = 3, then it is impossible to ascertain 
whether the counterfeit coin is lighter or heavier than the genuine ones (of which in this case 
there are none); if m — 2, then it is impossible to find the counterfeit coin. 
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outcomes of the first weighing are given by 
n+2 n+2 pth og Sb ch ! 
[ 3 +( 3 —1)]:2= 34 G5 a en 


n— 1 1 
a ( 


Zak 
ao % 3 3n’ 


and 


i.e., they are indeed slightly closer to each other than before; consequently, the 
entropy H(«,) of experiment «, is also slightly larger here. This none-too-large 
difference, however, suffices to assure the possibility that from k weighings we 
can find the counterfeit coin and ascertain whether it is lighter or heavier than 
the others. 

For proof of the fact that with even one a fortiori genuine coin at our disposal 
when nm < (3* — 1)/2 we can get along with & weighings, it is convenient to make 
use of mathematical induction. The statement is completely obvious for k = 1 
(i.e., for 2 = 1). We now assume this to have already been proved for a certain 
value k and consider the case when (3* — 1)/2 <n < (3*+} — 1)/2. If we prove 
that kK + 1 weighings are sufficient in this case, then from this stems the validity 
of our statement in all cases. Let us put in the first weighing on one beam some 
number x of our coins and on the other beam x — 1 of the n coins plus the 
lone a fortiori genuine coin; the number of coins not on the scale is then 
n, =n— (2x — 1). The number x is so chosen that we have 


We aa 
dx —1<¢3! and n-Qxe-1 <2 5 I, 
i.e., 
3* — ] 
3* >2x-—1lba- ae 
it is clear that this can be accomplished when n < (3*t! -- 1)/2 (because 


n — [(3* — 1)/2] < [(3*** — 1)/2] — [(G* — 1)/2] = 3"). If in the first weigh- 
ing, the beams balance, then what remains for us is only to find the counterfeit 
coin from the number n, < (3* — 1)/2 of the coins not on the balance. Since 
we have at our disposal definitely genuine coins too, hence (by the inductive 
assumption) this can be accomplished with k weighings. If, however, one of 
the beams is heavier in the first weighing, then we are left with 2x — 1 < 3* 
suspect coins. Moreover, in this case we know that if one out of some a coins 
is counterfeit, then it is lighter than others, but heavier if it is one of the remain- 
ing b coins, where a + b < 3* (if the first beam is heavier, thena == x — 1, 
b = x; if the second beam is heavier than a = x, b += x — 1). In this case also 
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from k successive weighings we can always find the counterfeit coin (see pp. 
110-111). 

We now return to our initial n < (3* — 3)/2 coins, of which one is counterfeit. 
In the first weighing we put a group of (3*-1 — 1)/2 coins on each beam; then, 
the number of coins not on the balance ist 


set BF 3 
n, =n — 2 — — < 5} (3 aa, a 


If the beams are equal, then the suspect coins are the n, < (3*-1 — 1)/2 not on 
the balance. Since, in addition, we also have 3*-1 — 1 a fortiori genuine coins, 
hence, by what has been proved above, from successive k — 1 weighings we can 
find the counterfeit coin and ascertain whether it is lighter or heavier than the 
genuine ones. If, however, either of the beams is heavier, then the suspect coins 
are the 3*-1 — I < 3*-! on the balance and we also know that the counterfeit 
coin is lighter than the others if it is one of the group of a = (3*-! — 1)/2 coins 
but heavier if it is one of the group of b = [(3*-! — 1)/2] (= a) coins. By 
what has been stated on p. 110, here too we can find the counterfeit coin from 
k — 1 successive weighings. This completes the proof of the assertion made 
earlier regarding the number of weighings that are necessary. 
We further note that for large » the number k, defined by the inequalities 


k-] e jog Qn + 3) < k, 
log 3 


can be quite accurately replaced by the ratio log 27/log 3 (in the sense that the 
ratio k : log 2n/log 3 rapidly tends to unity for increasing n). 

There are, of course, a great variety of different problems related to the 
determination of counterfeit coins by means of weighings on beam balances. So 
far, we have considered throughout that only one of the coins at our disposal is 
counterfeit (has a weight different from that of the rest of the coins); however, 
it may also be assumed, for example, that among the given coins, there are two 
or more that are counterfeit. Still more difficult are the problems in which the 
very number of counterfeit coins is also assumed to be unknown.{f We can 
even consider that the counterfeit coins have two or more different weights. An 
idea of the new problems arising in this case is given by the next problem due 
to H. Steinhaus, the Polish mathematician (see [63], p. 42). 


tin the case in which n equals (3* — 3)/2, the information /(a,, 8) about 8 contained in our 
experiment a, (first weighing) is exactly log 3. 

}tFor the case of two or more counterfeit coins see, for example, [54] (also [56]). The 
general case is dealt with in [53] and [57], of which the former contains a more detailed dis- 
cussion of certain distinctive variants of the counterfeit coin problem (with the indications of 
the possible applications of those problems) and an extensive bibliography. 
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Problem 30. There are four objects of different weights and a beam balance 
on which the weights of any pair of objects can be compared. It is required to 
indicate a method that enables us to determine the sequence of the weights of these 
objects by means of at most five weighings. Show that there is no way to guarantee 
the possibility of ascertaining the sequence of weights of the objects by means of 
at most four weighings. 

For 10 objects of mutually different weights there is a method for determining 
the sequence of the weights of the objects by means of at most 24 weighings (find 
this method). Can this number of weighings be decreased ? 

A complete solution of this problem (in which it is obvious that the number 
of objects can, in fact, be arbitrary) is not known so far; some particular results 
related to this can be found, for example, in [55] and [58]. There is also a series 
of other problems of similar type (we shall elaborate on this in the next section); 
as a rule, they are extremely tedious; however, information theory contributes 
at least a general approach to their investigation. 


3.3. Discussion 


In Sections 1 and 2 of this chapter, the concepts of entropy and information 
introduced in Chapter 2 were applied to analyze certain specific problems of the 
type of ‘mathematical recreations.” In what follows we shall see that reason- 
ing of the same kind is also found to be useful in the solutions of a series of 
sufficiently serious engineering problems. It will be therefore appropriate to 
discuss here in depth the general idea of all examples considered; as a result, 
we arrive also at a more general formulation of the problems, which is highly 
important for the next chapter. 

All the examples of Section 1 and 2 were constructed according to a single - 
scheme. In each of them, we were interested in a certain object from a finite 
set M of similar objects. For example, in Problems 23 and 24, the set M con- 
sisted of some towns and we had to determine the town in which the observer O 
was placed; in Problem 25, M consisted of positive integers and in Problem 26 of 


ee = 4950 pairs of integers, in Problems 27—29, M consisted of coins and 


the requirement was to find one of them, namely, the counterfeit coin; finally, in 
Problem 30, M consisted of all possible ordered collections of objects we had at 
our disposal (such that in the case of 4 objects, M contained all 4! = 24 possi- 
ble orderings of these objects) and the problem posed was to find out to which 
of these arrangements corresponded the weighing sequences of objects, starting 
from the heaviest and ending with the lightest of them. Using the terminology 
to which we have become accustomed in the first two chapters of this book, we 
can assert that we studied the experiment @ which can have n different outcomes 
B,, B:,..., Bn; also, we denote by M the set of all these outcomes. For separat- 
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ing the object of our interest (the outcome of 8), we make use of auxiliary ex- 
periments «, each of which can have m < n possible outcomes (the experiments 
« were either questions which could have two different answers : ‘yes’ and ‘no’, 
or weighings on beam balances which could have three different outcomes: E, R 
and L). The outcomes of experiments « separate some subset of the set M of the 
outcomes of 8, which enabled us to reject a number of outcomes B;, B,..., Bn 
as ‘false’ or ‘not occurring.” We were required to indicate the least number of 
auxiliary experiments « that were necessary to find the correct answer to the 
question we were interested in (i.e., to ascertain the outcome of experiment #) 
and to describe the precise manner in which this answer could be found most 
rapidly. 

The construction, consistent with the one described above, holds not only for 
the recreative problems of Sections 1 and 2 but also for many vital problems. 
Examples of the latter include the problem of efficient coding of messages, which 
is the foremost concern of this book (see Chapter 4); the problem of sorting out 
objects according to some criterion; the problem of searching for a word in a 
dictionary or a requisite book in a large library; the problem of designing an 
efficient control programme for some objects, say, for lathes in a factory, and so on. 
Lately, such a wide range of possible applications has evoked a great interest in 
the themes of Sections | and 2 and led to the creation of an elaborate termin- 
ology. The system of experiments « that leads to finding the object of our in- 
terest (the outcome of experiment 8) is called a questionnaire and the experiment 
a itself a question; moreover, the questions may differ with respect to both the 
number of possible answers} and, in a Series of cases, the ‘cost’ that character- 
izes the expenditure involved in the corresponding experiment « or the efforts 
that have to be put in to obtain a reply (i.e., to find the outcome of «). The 
problem is to find such a procedure for ‘asking questions’ (i.e., such a sequence of 
experiments «) as would lead to the desired answer (the outcome of 8) with the 
aid of the ‘shortest’ series of questions (in terms of numbers or total ‘cost’). 
There exists an extensive literature devoted to the theory of questionnaires, of 
which we may just mention [61] by K. Picard, the French mathematician and 
the Russian review paper [60]. 

It is clear that in all problems of the sort considered, it is desirable to utilize 
most expediently the information about the outcome of experiment 8 which is 
contained in the results of the auxiliary experiments «. However, it appears 
that the term ‘information’ is used here in the commonplace ‘everyday’ sense 


fIn principle, this does not also exclude the situation in which different possibie ‘questions’ 
« have different numbers of possible answers; thus, for example, it is possible to conceive a 
variant of the counterfeit coin problem such that in order to find this coin either questions can 
be asked of a person who knows which is the counterfeit coin (such an experiment can have 
two answers : ‘yes’ and ‘no’), or weighings of coins can be resorted to (this experiment can 
have three answers : E, Rand L). 
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and not in that more specialized sense which was given to it in Chapter 2. In 
fact, the quantity J we introduced in Chapter 2 had a purely statistical meaning— 
indeed, its definition was based on the concept of probability. However, in our 
problems many repetitions of trials do not figure, nor are probabilities involved 
anywhere; hence, the possibility of applying to these problems the theory deve- 
loped in Chapter 2 may seem odd at first glance. 

The circumvention of the difficulty indicated, which we actually used all the 
time, consists of the following. Suppose that we solve one and the same prob- 
lem many times (i.e., many times seek the correct answer to one and the same 
question), where the correct answers are found to be different in different cases 
and each of the answers has a definite probability of being correct; the corres- 
ponding probabilities p(B,), p(B,), ..., p(Ba) are considered arbitrary, but 
assigned beforehand. In such a case, we can speak of ‘experiment 8 which 
consists of finding the correct answer’, the term ‘experiment’ being used here in 
exactly the same sense in which it was used in the preceding chapter. Experi- 
ment B corresponds to the probability table 


Outcomes of experiment By B, Par Br 
Probabilities P(B,) P(B:) Ban P(B,) 
and its entropy is equal to — p(B,) log p(B,) — p(B.) log p(B.) — ... — p(Bn) 


log p( Brn) which we denote as usual by H(8). Since our auxiliary experiments 
« are always ‘directed straight’ to find the outcome of @ in the sense that the 
knowledge of this outcome completely determines the outcome of « also, so the 
assignment of probabilities to 1 outcomes of experiment 8 enables us to deter- 
mine also the probabilities of m outcomes of any such experiment «,. Hence 
with reference to «, also the term ‘experiment’ can be used in the same sense as 
in Chapter 2. Furthermore, from the fact that the outcome of 8 completely 
determines the outcome of «,, it follows that the conditional entropy Ha(«,) is 
zero, and the conditional entropy Ha,(8) equals the difference H(8) — H(«,) of 
the entropies of experiments B and «, (see p. 66). But the conditional entropy 
Hza,(8) is the mean yalue of the entropies Ha,(8), ..., Ha»(8) of experiment B, 
corresponding to distinct possible outcomes 4;, ... , Am of experiment «,. 
Hence, of these m outcomes, at least for one outcome Ai the entropy Hu,(f) is 
found to be not less than H(®) — H(«,); thus, cases are certainly possible for 
which, after determining the result of trial «,, the remaining entropy (the 
amount of uncertainty) of experiment B is not less than the difference H(8)— 
(a). 

It is clear how we can generalize this last reasoning. We arbitrarily choose 
a sequence of auxiliary experiments (trials) a, %,..., 4, i.e., we consider a 
certain compound experiment Ax = %,%)...%,. We assume also that the in- 
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dividual experiments «,, %2,..., «, are not necessarily independent, i.e., that the 
results of a preceding trial can influence the conditions for carrying out of suc- 
ceeding trials; it is even possible that for certain especial outcomes of the first 
few experiments «, all succeeding experiments become redundant, i.e., can be 
understood as experiments having a unique fully defined outcome (this means 
that the compound experiment A; consists of not more than k experiments % 
but not necessarily of exactly k such experiments). In the examples considered, 
knowledge of the outcome of 8 always determined the outcome of the compound 
experiment Ax, so that by using the probabilities of the individual outcomes of 
8 one can also find the probabilies of various outcomes of experiment Ax; 
hence here, too, the application of the term ‘experiment’ to A, should not cause 
any confusion, 

We also note that if each experiment «,, %,... , «, can have not more than 
m outcomes, then the total number of distinct outcomes of A; does not exceed 
m*, From the fact that the outcome of 8 determines the outcome of Ax it 
follows that the average conditional entropy Hz,(8) of 8, given the occurrence 
of the compound experiment Ag, is equal to the difference H(8) — H(A,) of the 
entropies of 8 and A,; hence for at least one outcome of Ax (i.e., for some speci- 
fied outcome of k trials «,, %,..., x) the ‘residual entropy’ of B is not less 
than H(8) — H(A;,). 

We now suppose that the difference H(8) — H(A.) is greater than zero. In 
such a case, for at least one outcome of the compound experiment Ax, there 
still remains some uncertainty in the outcome of 8. In other words, when the 
entire series of k expcriments of « is repeated many times and only those cases 
are separated out for which all the experiments « had some results specified 
beforehand, occasionally one or the other answer to our basic question 8 turns 
out to be correct. Hence it follows that for the cases in which the compound 
experiment A, has the indicated outcome, this outcome does not enable us to 
determine uniquely precisely which of the answers to the question considered in 
the probiem is correct; therefore, k experiments of « do not suffice here for this 
purpose. 

This very reasoning was used for the solutions of Problems 23—29. In add- 
ition, it was also taken into account that an inference on the impossibility of 
finding the outcomes of £8 by the k outcomes of experiments « can always be 
,made when at least for one choice of the probabilities p(B,), p(B.), . . . . p(Bn) 
of the outcomes of 8, the inequality H(8) — H(Ax) > 0 holds. It is usually 
found sufficient to consider only the ‘most disadvantageous’ case for which the 
entropy of experiment 8 assumes the maximum value, i.e., for which all outcomes 
of this experiment are equiprobable 

p(B) = p(B) =... = p(Bs) = =; 


this is precisely what we did in the foregoing when we said that ‘‘we shall 
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consider all the outcomes of 8 to be equiprobable, since no information is available 
on these outcomes.”’ It is obvious that subject to such a choice of probabilities 
for the outcomes of 8, the equality H(@) = log nistrue. Regarding the compound 
experiment A,, an exact calculation of its entropy is often not simple in specific 
problems; however, in many cases we may succeed by confining ourslves to the 
simple estimate H(A) < log m* = k log m, which stems from the fact that the 
number of different outcomes of A; cannot exceed m*. In more complicated 
cases, we evaluate quite accurately the largest ‘residual entropy’ of experiment 
B corresponding to the most ‘unsuccessful’ outcomes of the first experiment ¢,, 
and only after this we use the fact that the entropy of each of the succeeding 
experiments a, ..., «~ does not exceed log m (see pp. 111-112 and 116-117). 
Let us also note that the estimate H(Ax) & k log m leads directly to the important 
inequality 


log n 


ce log m 


V 


(1) 


This inequality can, of course, be deduced even without using concepts from 
information theory; this means that when n different possibilities are involved 
it is impossible to determine one of them uniquely with the aid of a compound 
experiment having possibly less than ” distinct outcomes.t Our foregoing 
estimate of the necessary number of experiments « frequently reduces to using 
only the simple inequality (1). 

Our basic conclusion on the infeasibility of determining uniquely the outcome 
of B by the outcome of the compound experiment A; when H(8) — H(A,) > 0 
can be justified somewhat differently also. If the outcome of the compound 
experiment A; determines completely the outcome of 8, then H,(68) = 0 and 
hence, by virtue of the equality /(A,, 8) = H(8) — Hua,(8), the information 
I(Ax, 8) about experiment 8 contained in experiment A; must be equal to the 
amount of uncertainty of 8, i.e., I(Ax, 8) = H(@). On the other hand, if the 
outcome of experiment 8 also uniquely determines the outcome of the compound 
experiment A;, then we have at the same time J(8, A.) = H(Ax). Thus, if the 
compound experiment Ax (consisting of not more than k experiments «) enables 
us in all cases to indicate uniquely the correct answer to a question asked (i.e., to 
ascertain the outcome of the experiment 8), then the equality H(Ar) = H(8) 
must hold. For instance, in the conditions of Problem 29, it is easy to see that 
H(«,) = log 3 = 1.58 bits (all outcomes of the first weighing were equally 
probable); furthermore, for any outcome of the first weighing, the second weigh- 
ing (experiment «,) was so chosen that its three outcomes had probabilities 4, 


tWe emphasize that calculating the number of possibilities available here is equivalent to 
using the simplest uncertainty concept in Hartley’s sense (see p. 53). 
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2 and 3 and, consequently, Ha,(#.) = —i log} — 2 log? — 3 log 3 = 1.56 bits 
(see pp. 114 and 115); finally, the third weighing (experiment «,), for the case in 
which a, had an outcome with probability 4, reduced to a comparison of two 
coins known to have different weights on the beam balance, i.e., had entropy 
log 2 = 1, but in the remaining ? of all cases (for either of the two outcomes 
of «, with probability 2) it could have three equally probable outcomes, i.e., had 
entropy log3. Hence, we have here Ha,«,(a,) = + log 2 + ? log3 = 1.44 bits 
and since H(8) = log 24 = 4.58 bits, we have 


H(Ag) = H(%4%%) = H(%) + Ha(a2) + Hayoo(%s) 
= 1.58 + 1.56 + 1.44 = 4.58 bits = Hi), 


as it ought to be. Jf, however, the equality H(A;,) = A(§) is not satisfied and 
we have the inequality H(Ax) < H(§), then this means that experiment Ax certainly 
does not allow the correct answer to be indicated uniquely. 

It is also easy to comprehend that the proposition that the outcome of 8 
completely determines the outcome of the trials « is not necessary for the last 
conclusion to be true. If this proposition does not hold, then the assignment of 
probabilities to individual outcomes of 8 does not enable us to predetermine 
uniquely the probabilities of all outcomes of the auxiliary experiments «. Hence, 
while assuming that the experiments for the determination of the outcome of 8 
by the outcomes of experiments «, are carried out many times, it is further 
necessary here to assign also the probabilities to the latter outcomes (of course, 
they should be such that their values do not contradict the already assigned 
values of the probabilities of the outcomes of 8). In this case, as before, if the 
compound experiment Ax = 4%, ... %,, Consisting of not more than k trials of 
a, completely determines the outcome of 8, then the information J(Axg, 8) = 
H(6) — Ha,(B) equals the entropy H(). On the other hand, since we always 
have I(Ax, 8) = H(Ax) — He(Ax) < H(A;,), the inequality H(8) < H(A,) must 
hold. Thus, as earlier, if 


H(Ax) < H(8), 


then the outcome of the compound experiment Ay = 4%... x cannot in all 
cases uniquely determine the outcome of 8. This conclusion enables us to obtain 
an estimate of the least number k of trials « that permits us to determine the 
outcome of 8. However, in the case under consideration here, the estimate so 
obtained is found to be usually strikingly less precise than in the case in which 
the outcome of & uniquely defines the outcomes of all trials of «. This is related 
to the fact that in the former case the trials of « are not directed straight to 
finding the outcome of 8 and, consequently, the information J(A,, 8) with respect 
to B, contained in the k trials 2, %,.., , % is not equal to but Jess than the 
entropy H(A,). 
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Let us, for example, assume that in the conditions of Problem 29 (see p. 116) 
we are not required to find out whether the counterfeit coin is heavier or lighter 
than the genuine ones (we are only required to indicate that this is a counterfeit 
coin). We assume that any one of the n existing coins may, with a definite 
probability, turn out to be counterfeit. In this case, we can calculate the prob- 
abilities of all outcomes of experiment 6. If, in addition, we assume that the 
counterfeit coin has a definite probability of being heavier or lighter than the 
rest of the coins, then we can determine also the probabilities of all outcomes 
of any trial «, which permits us to speak, with complete legitimacy, of the 
entropy of experiments « and 8 and the information contained in either of them 
with respect to the other. In particular, if we consider that all outcomes of 
experiment 6 are equally probable (i.e., that each of the m coins has the same 
probability of being counterfeit), then the entropy H(@) of experiment A equals 
log n. On the other hand, the entropy of each of the experiments « does not 
exceed log 3 (since an experiment of this sort, as before, can have three 
distinct outcomes: E, L and R) and the entropy of the compound experiment 
Ag = aa... % does not exceed k log 3. This implies that the least number 
k of weighings required to find the counterfeit coin must satisfy the inequality 


log n 


This estimate leads to the number k being smaller than that in an analogous 
estimate of the least number of weighings necessary to determine the counterfeit 
coin and find out whether it is lighter or heavier than the rest; the inequality then 
has the form 


log 2n 
JOB <n (3) 


(because here experiment 8 has 2x distinct outcomes, since each coin may turn 
out to be either lighter or heavier than the rest). However, estimate (3) is rather 
exact : thus, for k = 3, estimate (3) gives n < 13 and, as we know, the maximum 
number of coins from which (in three weighings) we can separate a counterfeit 
coin and determine whether it is lighter or heavier than the rest is 12 (see 
Problem 28). In contrast to this, estimate (2) is highly inaccurate : it yields 
only the inequality n < 27 for k = 3, whereas we can actually verify that the 
maximum number of coins from which the counterfeit coin can be separated in 
three weighings without ascertaining whether it is lighter or heavier than the 
rest is only 13. This is explained by the fact that here the experiments « (i.e., 
the weighings of the coins) are not directed straight to the determination of the 
outcome of @ (they contain ‘extraneous’ information, namely, information 
about the weight of the counterfeit coin). Hence the contribution of each such 
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experiment to the information accumulated about the outcome of 8 is signi- 
ficantly less than log 3 and, consequently, the number of experiments « has to 
be considerably Jarger than log n/log 3. 

Let us now revert to the question of how it can be established that the out- 
come of the experiment B we are interested in can indeed be uniquely determined 
by means of not more than k auxiliary experiments «. We have spoken so far 
only of the proofs of the infeasibility of finding the outcome of 8 with the aid 
of sufficiently small number of trials «. Similar ‘proof of feasibility’ involves 
indicating explicitly the most expedient chain , %, ..., ¢x of auxiliary experi- 
ments, or in other words, indicating the appropriate compound experiment Ax. 
Of course, the ‘solution’ obtained in this case does not include the entropy and 
information concepts. These concepts nevertheless play an important heuristic 
role, since they are handy in determining most rapidly the appropriate chain of 
trials. In fact, the objective of our trials is to ascertain the outcome of experi- 
ment 8, i.e., to obtain complete information about this experiment. Hence, it is 
natural to select these trials in such a manner that they contain maximum possi- 
ble information about the outcome of 8. A rigorous method of solving the 
problem is to enumerate all the possible compound experiments Ay =a,%, ... &z, 
evaluate the information J(Az, 8) for each of them, and select those Ax which 
satisfy the equality /(Az, 8) = H(®). In the case in which the outcome of 6 
uniquely determines the outcome of all trials 2, the evaluation of information is 
considerably facilitated by the fact that here we should have (Ar, B) = H(A;). 
However, since it is inconvenient to operate directly with the compound experi- 
ments A,, in practice it is usual to start with the determination of that auxiliary 
experiment a, (Ist trial) that contains the greatest amount of information I(«,, B) 
about the outcome of 8, then select the second trial «, (depending in general on 
the outcome of «,) such that the information I(«,%, 8) is the maximum possible, 
and so on. This is exactly what we did in solving Problems 23-29. 

We have assumed, throughout Sections | and 2, that all outcomes of experiment 


f We give one instructive example to illustrate the complications that may arise in the reali- 
zation of this programme for those cases in which Hp! «) # Oand the trials « are not directed 
wholly to the determination of the outcome of experiment 8. Suppose we must determine by 
means of weighirgs on a beam balance whether a single counterfeit coin among four given 
coins is lighter or heavier than the rest (however, it is not required to find the counterfeit 
coin). It is obvious that here every weighing «, contains zero information with respect to the 
experiment @ we are interested in (since in any outcome of experiment a, the probability that 
the counterfeit coin is lighter and that it is heavier than the genuine coin is in no way altered), 
i.e., any choice of a, leads to one and the same result, which is discomforting at first sight. 
However, the obligatory equatity /(a,, 8) = 0 does not at all mean that the auxiliary experi- 
ments « are useless ; experiment a, supplies directly no information about 8, but it then 
enhances the suitability of the subsequent trials for this purpose. In fact, it is easy to see that 
after placing one or two coins on each beam (i.e., choosing experiment «, quite arbitrarily), 


we immediately arrive at the position in which by means of one more weighing (experiment a.) 
we can uniquely determine the outcome of 8. 
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B are equally probable. This assumption means that all outcomes of 8 are consi- 
dered to be equivalent. This is completely natural since it is necessary that a 
larger number of trials are not involved in order that we are able to determine 
the outcome of 8, no matter what this outcome may be. Clearly, the route for 
determining the outcome of §, satisfying this condition, leads in general to the 
compound experiment Az consisting of all cases (i.e., for every outcome of @) of 
roughly one and the same number of individual trials «. Let us recall, for 
example, Problem 25 of Section 3.1, which required us to ascertain, using a 
minimum number of questions, which of the numbers from 1 to 10 was thought 
of by a certain person. In the solution of this problem, we proposed to clarify in 
the first place whether the unknown number x cxceeded 5 (trial «,); next, depen- 
ing on the outcome of «,, we recommended determining whether or not the 
number x was greater than 7 or 3 (trial «,); further, taking account of the out- 
come of @,, it was possible to inquire whether or not x was greater than 8 or 6, 
or 4 or 1 (trial «,); finally, if the three trials «,, «, and «, failed to classify the 
value of x, then we inquired further concerning whether or not x was greater 
than 9 or 2 (trial «,). In all cases, it was necessary to make use of not more than 
four questions to determine the number x. Moreover, if x happened to equal 
one of the numbers 2, 3, 9 or 10, then the number of questions needed was 
exactly four and in the remaining six cases it was three. Clearly, had we asked 
at the very start whether or not the number x was equal to, say, 10, then we 
would have a definite chance of doing the trick with exactly one question; how- 
ever, in most cases, we have to invest a larger outlay than a series of four ques- 
tions, which makes such a method of determining the outcome of £ less profit- 
able. 

W now note that had we started with the question of whether or not the un- 
known number x exceeded 8, then we would have had a chance of finding x by 
asking two questions in all (if this number x were 9 or 10), and at the same time 
we would not have needed in any case to ask more than four questions (because 
if after the first question we found that the number x did not exceed 2? = 8, 
then we could have determined it by means of three more questions; see the 
solution of Problem 25). Thus, a cursory glance suggests that such a method 
of finding the unknown number x is more profitable than that proposed in Sec- 
tion 3.1. However. this is a rather hasty conclusion. In fact, if we do not consi- 
der the length of the /ongest chain of trials as a single criterion, but determine 
the value of a method for finding x, taking into account also the fact that in 
some cases this method Jeads faster to the goal, then even with respect to the 
method developed above we must consider the fact that in many cases it allows 
us to find x with the aid of three and not four questions. 

In order to compare the ‘merits’ of both methods used for the solution of 
Problem 25 in the light of the foregoing new approach to it, we assume that the 
trials to find the unknown number x are repeated many times, and consider as 
before that the probability of all ten numbers being thought of is the same. In 
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the first method for the solution of the problem, we have to ask three questions 
altogether in roughly = = 2 of all the cases and four questions in ;45 = 7 of 
the cases (when x equals 2, 3, 9 or 10). Thus, the mean value of the number of 
questions asked here is 


3 2 17 
5 xXat> Ge aa 


The second method for the solution of the problem assures the determination of 
x with the aid of two questions in 4, = ¢ of the total number of all trials 
(when x equals 9 or 10), whereas in the remaining 4, = ¢ cases, four questions 
must be asked. Hence, mean value of the number of questions asked here is 
given by 


1 
5 


18 


KIFEX4= = 36. 


Thus, on the average the second method of finding x is slightly less advantageous 
than the first. This situation has a general character, which can be verbalized 
as follows: whatever the number n, there does not exist a method for the solution 
of Problem 25 which, on the average, would be more advantageous than the one 
given on pp. 105-106. 

The preceding conclusion lends a new insight into the problems considered in 
Sections 3.1. and 3.2. It also makes more transparent the idea underlying the use 
of the concepts of entropy and information for the solution of such problems. 
It is clear that the application of these concepts, having an essentially statistical 
character, is completely relevant only to those cases in which the problem to be 
solved itself has a statistical character, i.e., it is related to the many repetitions 
of one and the same trial. The whole point is that we can also understand the 
foregoing Problems 23-29 in exactly this way if we are interested not in the exact 
number of trials « that are required to somehow determine a single outcome of 
experiment f but rather in the mean value of this number when the indicated 
experiment is repeated many times. If, in this case, it is further stipulated that 
all outcomes of 8 are considered to be equally probable, then for a choice of the 
trials «,, 4, ..., % such that the mean value of their numbers is the least, the 
number of trials performed is nearly the same for all outcomes of 8B. Hence, 
also the largest value of the number of trials involved here is, in general, the 
least possible. 

Let us now try to do away with the condition wherein the outcomes of 8 are 
considered to be equally probable. By way of an eximple, we recall Problem 25 
but now make its structure slightly more complex. Suppose that someone thinks 
of a definite number x that can take one of n values. It is required to find this 
number by asking some “yes-or-no’ questions. It is further assumed that we 
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have, beforehand, some information about the number x because of which we 
must consider the 1 possible values of this number to be not equally probable, 
i.e., some of them are more likely to turn out to be the number thought of than 
the others.| How should the questions be asked in this case? 

It is clear that, if none of the n values of x is completely excluded by the in- 
formation available to us (otherwise we should have spoken not of n, but of a 
smaller number of possible values of x), then the least number of questions, 
which guarantee in all cases the determination of the number x, is defined as 
before by the inequalities (1) of Section 3.1 (p. 106), and it is necessary here to 
ask questions exactly in the same way as stated previously. Indeed, if there 
were a sequence comprising fewer questions, that would enable us in all cases 
(i.e., independently of the answers to these questions) to determine uniquely the 
number x, then this would contradict the result of Problem 25. However, this 
still does not imply that it is always expedient to act in exactly the same way as 
in the case in which all values of x are considered to be equally probable; after 
what has been stated above this ought to be perfectly clear. Thus, for instance, 
if there is a very large probability that the number thought of has some definite 
value x (say, if this probability is 0.99 or still higher), then it is obviously 
reasonable to ask in the first place whether or not x is equal to this number xo, 
in spite of the fact that in the case of a negative answer we waste one question 
without deriving much profit (the set of possible values of x is simply decreased 
by one). Moreover, in the general case it is profitable that every time the set of 
possible values of x is partitioned into two such parts that the ‘probabilities’ of the 
‘number thought of’ to belong to either of these parts be as nearly equally likely 
as possible. This partition ensures that the entropy of experiment « (when a 
consists of asking whether or not x belongs to one of these parts) will be the 
largest possible and, consequently, also ensures that the information contained 
in « with respect to the experiment 8 we are interested in will be the maximum 
possible. It is true that here we are not yet able to guarantee the minimality 
of the number of questions that we may require in the most unfavourable case 
but, on the other hand, here the mean value of the total number of questions is, 
in general, less (or, in any case, not greater) than that in any other formulation 
of questions. 

In place of a rigorous proof of the preceding statement we shall confine our- 
selves here to the verification of it for a simple particular example (see the text 
in small print at the end of this section). For the most general case, it is com- 
paratively easy to establish only that the mean value/ of the number of questions 
required to determine x is always not Jess than H(B) (where H(@) is as usual the 


{Specifically, we may suppose that the number thought of has been written and the person 
who is to guess has glanced at what was written but is not sure of what he has seen. However, 
the rigorous sense of this condition is obviously connected with the assumption that, in the 
process of repeating the procedure of guessing many times, some numbers are found to be 
guessed more often than the others. 
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entropy of our experiment 8).f This result is an extension of the inequality 
k > log n, which is related to the case in which all possible values of x are 
equally probable; it can be justified by reasoning that is closely analogous to that 
which led us to the stated inequality. In fact, the information supplied by an 
answer to a question obviously cannot exceed one bit in any case. Hence, by 
asking k questions, we obtain information not exceeding k bits. If we now 
determine many times the number thought of (say, 10,000 times) by asking ques- 
tions according to some method chosen by us, and if the probabilities that the 
unknown number coincides with any of the m numbers have assigned values, then 
the mean information given to us in one determination of the number x equals 
H(®) and the total information obtained after 10,000 repetitions of guessing is 
close to 10,000 H(@). Of course, the number of required questions may vary 
here substantially from one determination of x to the other depending on pre- 
cisely what number x is thought of (it suffices to recall the case in which there 
exists a definite number x, such that the probability that it will be guessed is 
very large). However, by the very definition, the mean number / of the total 
number of questions asked in all the 10,000 experiments for finding x is close to 
10,000 7 (this means that, on the average, one inquiry about x involves exactly / 
questions). Hence, we may infer that the inequality 


10,000 H(G) < 10,000 J, 


1.¢.;5 


I > H() (4’) 


must be satisfied. This is what we are required to prove. Since inequality (4’) is 
vitally important in information theory (see Chapter 4, Section 2 in this regard), 
we shall deduce, in the sequel, a totally different and highly elegant proof of it, 
which though more formal, is simpler in concept (see the concluding portion 
of this section). 

All that has been stated above with respect to Problem 25 can easily be carried 
over also to Problem 27 (pp. 108-109). Here we must generalize the condi- 
tions of the problem slightly by considering that different coins have different 
probabilities of being counterfeit (this can be understood, for example, in the 
sense that the outward appearance of some coins creates suspicion to a varying 
extent). In this case, it is most expedient that in each weighing the suspected 
coins be divided into three groups such that the probabilities of the counterfeit 
coin being found in the two numerically equal groups of coins placed on the right 
and left beams of the balance and the third group npt on the beam balance 


{For the case in which n is quite large, and the probability of each individual value of x is 
small, we can also show that this mean value is very close to H(A) (see Chapter 4). 
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always remain as close as possible to each other. In this setting the total num- 
ber of weighings needed for finding the counterfeit coin may, in an unsuccessful 
case, be found to be even greater than that given by inequality (2) of Section 3.2 
(p. 109); however, the mean value of the required number of weighings in this 
case remains the least. We can also show that this mean value lis always not 
less than H(8)/log 3, where H(8) is the entropy of the experiment that consists 
of finding the counterfeit coin, i.e., 


(4”) 


(see, in particular, the concluding part of this section). Moreover, when the 
number of coins is large and the probabilities of any one of them being counterfeit 
are small, the mean value lis always very close to H(8)/log 3. 


We now give a simple example to illustrate the fact that for finding a thought of number x 
(not exceeding some n) it is of greatest advantage to partition the set of m possible values of x 
each time into two parts such that the probabilities of x belonging to either of the parts are 
closest possible to each other. 

Suppose that the number a of possible values of x equals 4; in this case the number & defined 
by inequalities (1) (p. 106) equals 2. Assume now that we are justified in assuming that one 
value x, of x is more probable than the other three values x,, x, and x,. Let p be the prob- 
ability that x equals x, and g be the probability that x equals x; (where i is any of the num- 
bers, 1, 2,3; p > g, p + 3¢g = 1). For the first question, we can ask whether x coincides with 
either of the numbers x, or x,; we could also begin by asking whether x and x, are equal. The 
experiments that consist of asking these two questions we denote by «{1) and a2), Since the 
outcomes of «{1) have the probabilities p + g and 2g, therefore H(«{1))= —(p + q) log(p + q) 
— 2g log (2g). The two outcomes of experiment «‘2) have, however, probabilities p and 3q so 
that H(a!2)) = —p log p — 3q log (2g). If p > 4, then obviously the outcomes of experiment 
a2) have more nearly equal probabilities than the outcomes of experiment «(); if, however, 
4 > p > q, then we have to compare the differences (p + ¢) — 2q = p — qgand 3g — p between 
the probabilities of the two outcomes of experiments «{1) and a{2). Since p — g > 3q — pif 
p> 2g, ie. if p > 2 (because g = (I — p)/3 and p> = (1 —p) when p > 2), we infer 
that when p > 2 we should start with experiment «‘2) and when p < 2 with experiment a{1); 
when p = 2 it is apparently immaterial which of these two experiments we start with. 


If our first question is ‘whether x is equal to e ther of the numbers x, and x,’, then we parti- 
tion the set of possible values of x into two numerically equal parts; in this case any answer to 
the first question enables us to find x with the aid of just two questions. If, however, our 
starting question is ‘wether x is equal to x,’, then we have a definite chance to find x by a single 
question; the probability of this being precisely so is equal to the probability that x concides 
with X,, i.e., equals p. However, if x is not equal to x,, then we may not be able to guarantee 
the possibility of finding x by the succeeding question; the question ‘whether x equals the num- 
ber x,’ may be followed by a positive answer (the probability of this being g) but, equally, it 
may be followed by a negative answer (the probability of this equals the probability that x 
coincides with x, or x3, i.e., equals 2g) and, in the latter case, one more question (the third 
question) is required. Thus, for the case in which we begin with experiment «{2), we have the 
probabilities p, g and 2g of finding x by one, two, and three questions, respectively. Hence 
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we see that here the mean value of the number of questions is given by 


PX1+q@x2+2%x3=p + 8q. 


It is easy to verify that p + 8g <2ifp > 2 (because p + 89 = (8 — Sp)/3, since q = (I — p)/3). 
We are thus convinced that it is indeed appropriate to begin with experiment «!2) in the case 
in which p > ae 


Before we conclude this section, we deduce one more rigorous proof of inequalities (4’) and 
(4”) which is not based on any results of Chapter 2 except for the definition of the entropy of 
an experiment. Instead, we shall make use of the following fact. Suppose that p,, Po, ---» Pn 
are any n positive numbers whose sum is 1 and qi, qG2,...+4n are any other positive numbers 
whose sum does not exceed |. Then, we always have the inequality . 


—p, log py — pz log pz - --- — Pa log py < —p, log 94, — pz log gq, —--+ — pploggn. (*) 


We defer a complete proof of inequality (*) to Appendix I; at present we note only that, for 
n = 2, py = pe = $591 + G2 = 1, this inequality assumes the form 


1 1 
_ x lee 1-5 lea > |, 


or differently, 


1 | 1 ‘ — _! a1+4 

7 108 4 + 10842 < —1 = log >> ie, Vda < reas a 5 aS 

Thus, if p; = p, = 3 and 4, + g, = 1, then inequality (*) reduces to the well-known inequality 
between the arithmetic mean and the geometric mean of two numbers. 


Now we recall the experiment 8, which has n outcomes B,, B,,..., B, giving the probability 
table 
Outcomes of experiment B, By vee B, 
Probabilities Py Pr sees Pn 


Suppose that, to find which of the outcomes of 8 actually occurs, a sequence of trials a (the 
auxiliary experiments), of which each can have m different outcomes, is carried out. We denote, 
as before, the largest number of trials that may be required for determining the outcome of 8 
by &. Suppose further that n,, m2, ..., , are the numbers of those outcomes of 8 that can be 
ascertained by means of one (a), two (a, a2),..-, and K (a, a,..., a) trials. It is obvious 
that m, + Mg ++ +> +=. 

We note that the number 1, of outcomes of @ that can be determined with the aid of one 
trial a, obviously does not exceed the number m of outcomes of «: 


my qm. 


Moreover, , = m only in the case (which is evidently of trivia! interest) for which n =- mand 
to each outcome of «, there corresponds a unique outcome of @ (for example, when in the 
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conditions of Problem 25 the number of possible values of the number thought of is 2). If 
however, there exist outcomes of «, that do not uniquely determine the outcome of Q, i.e., if 
there are cases in which it is found necessary to carry out the succceding trial «., then surely 
n, < m. In this case, the number of outcomes of «, that do not uniquely determine the out- 
come of 8 ism — n,. Since the number of outcomes of «, equals m, the number n, of those 
outcomes of B that can be determined with the aid of two trials a, and «, certainly satisfies the 
inequality 


ne <(m — n)m = me? — nym. 


Quite similarly, if in certain cases we must also carry out the third auxiliary experiment a,, 
then n, < (m— n,m. Moreover, here the experiment «, is necessary for not more than 
(m — n)m — ny outcomes of a,. Furthermore, since experiment «, itself has m different out- 
comes in all, it is obvious that 


ny < [(m — ny)m — nom = mM — nym? — nm. 


In exactly the same way we can show that 
Ng < [(m? — nym? — nn) — njJm = m1 — nym — nm? — nym, 


and so on. Finally, for the number #; of outcomes of 8, whose determination involves exactly 
k trials, it is easy to obtain by induction that 


my, < [(m** — nymk-? — nym --. — My_2M) — Ny) 


= m* — nmk-) — nymk-? — «6. — ny_gm® — ny_ym. 


Let us transfer all the terms on the right-hand side here to the left-hand side, except the first 
term m* and divide both sides of the resultant inequality by m*; then, we obtain 


BLL ae eee To 
+ Pe Fog orl 


We denote by /; (where i = !, 2...., a) the number of trials « that have to be carried out 
to determine the outcome of # in the case in which this outcome is found to be B;. Insucha 
case, out of mn numbers /;, there are my, m2,...,”, equal to 1,2,...,, respectively. Hence 
the preceding inequality can also be written in the form 


1 1 1] 
ee ae re 


We now recall that for inequality (*) to be valid the only requirement is that the sum ofall 
numters p; be 1, and that the sum of all numbers q; (i = 1, 2,..., a) should not exceed 1. 
Hence, we can put into this inequality, in particular, p; equal to the probabilities of the ith 
outcome B, of experiment @ and q; equal to 1/m!, so that 


- 
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1 1 
—Pi log pi — Pz log pz — +++ — Pn lOB pn < —Pi log TT — Palos Fe 


, 1 
tt Pel 


The left-hand side of the preceding inequality is obviously the entropy A(®) of 8B. On the 
right-hand side we now replace —log (1 /m'#) (where i = 1, 2,..., m) by /; log m to obtain 


AB) < [Pala + Dale +--+ + Palp] log mm. 


But by the very definition of the mean value (see p. 6) the sum p,l, + Pals +--+ + Palais 
exactly equal to the mean value | of the required number of trials «. We thus obtain the basic 
inequality 


> AB) 
log m 


(4) 


This is also the result we wished to prove. When m = 2 (for example, for the case in which 
experin:ent @ is a ‘yes’ or ‘no’ question, it carries over to inequality (4’) (because log 2 = 1) 
and when m = 3 (for example, for the case in which « is a weighing on a beam balance) it 
carries over to inequality (4”). 


4, 


Application of Information Theory to the Problem of the 
Information Transmission Through Communication 
Channels 


4.1. Basic concepts. Efficiency of a code 


In order to illustrate the usefulness of the concepts of entropy and informa- 
tion, which were introduced in Chaptcr 2, we had analyzed in Chapter 3 a series 
of ‘recreative problems’ of such types as are usually popular among high school 
and undergraduate mathematical enthusiasts. In the present chapter we con- 
sider some of the simplest, but in their own right sufficiently serious, applications 
of these very concepts to the important engineering problem of transmitting 
information through communication channels. These applications are also 
shown to have much in common with the already considered ‘recreative prob- 
lems’ of finding a thought of number by means of asking questions or a 
counterfeit coin by means of weighing, so that many arguments from the preced- 
ing chapter can be carried over directly to the solution of practical problems in 
communication engineering. 

The starting point is a general scheme for transmitting information through a 
communication channe]—for definiteness, say, through a telegraph line. At one 
end of the line, the transmitter is fed some message, consisting of a series of 
symbols selected, say, from a set of 27 characters of English (e.g., Latin) alphabet 
(26 letters and also a ‘zero character’, a space between words), or of 33 charac- 
ters of modern Russian alphabet (also including a space), or of 10 digits (in the 
case of numerical information), or of all the characters and digits taken together. 
For the transmission of this message in the case of an ordinary wire telegraph, 
we Use current, some characteristics of which the telegraphist can change at his 
discretion. This enables him to set up a definite sequence of signals which are 
discernible by the other telegraphist at the receiving end. The simplest distin- 
guishable signals that are extensively used in practice are those of switching on the 
current pulse (i.¢., switching on the current at some well-defined instance) and 
cutting off the current, thus creating a pause (the cut-off of current at the same 
time). Any message can be transmitted by means of these two signals, if every 
character or digit is agreed upon to be replaced by a definite combination of 
current pulses and noncurrent pauses. 

In communication engineering, the rule that associates some combination of 
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signals with each message to be transmitted is usually called a code (in the case 
of a telegraph, for example, a telegraphic code) and the operation of the trans- 
mission of message into a sequence of distinct signals, the coding of message. A 
code using only two distinct elementary signals (such as, current pulse and non- 
current pause) is called a binary code and a code using three distinct elementary 
signals a ternary code, and so on. In telegraphy, in particular, a number of 
distinct codes are used, the most noteworthy of which are Morse code (Morse 
alphabet) and Baudot code. In the Morse code, we associate with every letter 
or digit of the message some sequence of short-duration current pulses (dots) and 
three times longer current pulses (dashes), separated by short duration pause of 
the same Jength as a ‘dot.’ Moreover, the gap between the letters (or digits) is 
recorded by a special separation mark, a long pause (of the same length as a 
‘dash’), and the gap between the words by a pause that is twice as long as that 
between individual characters. Although this code uses only current pulses and 
pauses, it can be regarded as a ternary code because every encoded piece of in- 
formation in this case naturally decomposes into a collection of the following 
three relatively large ‘elementary signals’—dots, to each of which (within the 
letter or digit encoded) is invariably added a dot-length pause, a dash followed 
in each case by a short-duration (dot-length) pause, and lengthy pauses (dash- 
length) that separate the individual characters. Morse code is at present used 
only when the basic telegraphic channels are damaged, and also in short-wave 
radio-telegraph, which finds many important applications. 

The binary Baudot code is mostly used in the ordinary letter-printing tele- 
graphic apparatus installed at all the modern telegraphic offices. This code asso- 
ciates with every character some sequence of five elementary signals—consisting 
of current pulses and pauses of the same length. Since all the characters are 
transmitted here by a combination of signals of the same length (codes having 
this property are called uniform codes), no special mark is necessary in the 
Baudot code for separating one character from another, because it is known 
beforehand at the receiving end that after every five elementary signals one 
character terminates and the succeeeding one starts (in the receiving apparatus 
such partition of the sequence of signals into combinations of five signals is 
usually carried out automatically). Since by combining the two possibilities of 
the first signal with the two possibilities of the second, two of the third, two of 
the fourth, and two of the fifth, we can compose altogether 25 = 32 different 
combinations, the Baudot code in its simplest form allows us to transmit 32 
distinct characters.f 


{Since 32 combinations for the transmission of all characters and digits are found to be 
inadequate, in apparatus using the Baudot code there are two registers; after switching on the 
registers the same combination is used for the transmission of one more character. The num- 
ber of possibilities is thus almost doubled, which enables us to transmit all let(ers, digits and 
punctuation marks. In the case of a single register such possibilities are, however, admitted in 
a code that associates with every letter or digit a combination of six elementary signals; such 
codes are also used sometimes in telegraphy. 
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In certain types of telegraphic apparatus, besides simple on and off currents, 
it is also possible to reverse the current direction. This affords an opportunity 
to discard current pulses and pauses and instead use as basic signals the current 
pulses in two different directions or even use simultaneously three distinct ele- 
mentary signals of the same length: the current pulse in one direction, the 
current pulse in the other direction, and a pause. Still more complex types of 
telegraphic apparatus are also possible, in which the pulses are differentiated 
not only by the direction but also by the ampKtude of the current. This gives 
us an opportunity to further enlarge the number of distinct elementary signals. 
An increase in the number of such signals allows us to make a code more com- 
pact (i.e., to decrease the number of elementary signals required for the trans- 
mission of the given information or to transmit by means of signals of the same 
length a significantly larger number of different ‘characters’). However, at the 
same time it complicates and makes costlier the transmission system, so that in 
practice it is always preferable to use a code with a smaller number of element- 
ary signals. 

In a radiograph, in place of the current amplitude, some parameters of a radio- 

wave (sinusoidal oscillations of high frequency) are varied, i.e., the elementary 
signals here have a different sense. However, in this case also every character 
to be transmitted is replaced by some sequence of elementary signals that are 
discernible at the receiving end of the channel. A similar situation holds also 
in the majority of other communication channels. This is discussed in greater 
depth in Secs. 4.3 and 4.4. 
_ We now dispense with engineering details and formulate the fundamental 
mathematical problem we have to deal with in communication engineering. 
Suppose that there is a message written by means of some ‘alphabet’ contain- 
ing ‘character’ (say, 27 English characters, or 33 Russian characters, or 10 digits, 
or all the characters and digits, or characters, digits and punctuation marks and 
so on). It is required to ‘encode’ this message, i.e., to indicate a rule which 
would associate with every such méssage a definite sequence of m different 
‘elementary signals’ that make up the ‘alphabet’ for transmission. How this can 
be made most advantageous? 

In the first place, we must clarify in what sense the term ‘advantageous’ is 
understood here. We consider that the more advantageous a coding is the fewer 
are the elementary signals that have to be used for the transmission of message. 
If it is assumed that each elementary signal takes up the same time, then the 
most advantageous code is that which allows us to spend the least time in 
message transmission. Since the installation and maintenance of communication 
channels are usually very expensive (and in the case of radio communication, 
where a Slightly different position holds, an indiscriminate increase in the num- 
ber of communication channels is impossible because this can give rise to inter- 
ference between adjacent channels), it is surely of great importance to move on to 
a more advantageous code that enables us to use a given communication channel 
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more efficiently. 

We shall now try to analyze somewhat in detail such sorts of codes as are 
generally used. For definiteness we shall assume for the time being that m = 2 
(i.e., our code is binary). In addition, we restrict ourselves only to the case of 
one-letter coding, i.e., to the case of codes that are suitable for transmitting each 
individual letter of a message (we shall speak later about the opportunities open- 
ed up by the rejection of this last restriction). In such a case the coding ob- 
viously is such that each of the a ‘letters’ of our ‘alphabet’ is assigned some 
sequence of the two elementary signals, called the code word associated with the 
corresponding ‘letter.’ If we choose to ignore the physical nature of the ele- 
mentary signals to be used, we can replace them by the digits 0 and 1, i.e., 
consider all code words as some sequence of these two digits. For assigning a 
code it is necessary to enumerate n such sequences to be associated with n exist- 
ing ‘letters.’ Besides, not every m distinct sequence of the digists 0 and | is 
suitable for practical use in a binary code; it is still necessary to assure that the 
encoded information can be uniquely decipherable, i.e., ina long sequence of digits 
0 and 1 assigned to a multiletter message, it should always be possible to under- 
stand where the code word of one letter ends and that of the succecding letter 
starts. It is quite simple to achieve this if, as in Morse code, a special separat- 
ing symbol is introduced (in the engineering literature, such a symbol is some- 
times called a ‘comma’), which is distinct from all other code words and easy to 
distinguish, and this symbol is transmitted between the code words of each two 
‘letters.’ It is, however, plain that this method can hardly be advantageous, 
since here the number of ‘letters’ in the message to be transmitted is almost doubl- 
ed (due to the addition of the (2 + 1)th separating ‘letters’ inserted between 
every two other letters). Hence in the following we shall be interested only in 
uniquely decipherable codes without a separating symbol (i.e., ‘codes without a 
comma’). Examples of such codes are, in particular, those codes in which the 
code words of all letters have the same length (i.e., uniform codes; see the fore- 
going description of the Baudot code). In addition, there are also many non- 
uniform codes (containing code words of different lengths) that can be uniquely 
decipherable and hence do not require a separating symbol. Thus, for example, 
in the case of a two-letter alphabet (in which 1 = 2) the simplest code without 
a comma is the uniform code with the code words 0 and 1; however, if we re- 
place the code word | by a collection of two digits 11, or 10, or 01 (but, obvious- 
ly, not by 00), then such a code is all the more easy to decipher uniquely (in 
all these cases the code words of the second letter are easily identified in any 
long sequence of code words of both types by the digits | appearing in them). 

A more general necessary and sufficient condition that separates uniquely deci- 
pherable codes among all other collections of 1 sequences of the digits 0 and 1 
can be found in [65] (in this connection see also [64], which deals with the gene- 
ral theory of binary nonuniform codes). For our purpose here it is however ade- 
quate to remark only that a nonuniform code can surely be uniquely decipher- 
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able if no code word is a prefix of any other longer code word (so that, for example, 
if ‘101° is the code word of some letter, then there cannot be a letter having the 
code word ‘1’, ‘10’, or £10110’). In fact, if this condition is satisfied, then by 
reading consecutively the coded script of a message and having before us a list 
of all code words, it is always possible to tell exactly at what place the code 
word for one letter ends and that for the succeeding one starts (since here the 
sequence of elementary signals that starts after the termination of a recurrent 
code word itself forms a code word only if we cut it off strictly at one definite 
place).t We further note that a uniform code also obviously satisfies the con- 
dition set forth above in italics. As arule, we shall not consider below codes 
that do not satisfy this condition. Hence from now on, unless we say otherwise, 
by a ‘code’ we shall mean a collection of n code words associated with n characters 
of an alphabet for which the condition indicated above is satisfied. 

Let us now take up the question of the relationship of binary coding to Prob- 
lem 25 on finding a thought of number, which does not exceed n, by means of 
questions that can be answered ‘yes’ or ‘no.’ This relationship is most straight- 
forward. In fact, suppese that we have some binary code; it is convenient to 
consider that the ‘characters’ associated with our code words are all possible 
numbers from 1 to n. Let us further suppose that it is required to find a thought 
of number, which does not exceed n. Then we may ask in the first place the 
question ‘‘Is the first numeral of the code word of the number thought of equal 
to 1? By way of the second question we may ask ‘‘Is the second numeral of 
this code word equal to | ?”? and so on. We thus consecutively determine all 
numerals of the code word of the number thought of: since none of these words 
is a prefix of the other, as soon as we arrive at the combination of numbers that 
make up the code word, we can ascertain the number thought of with complete 
certainty and announce it. Thus, every binary code for an n-letter alphabet corres- 
ponds to some method of finding out one of the n numbers thought of through ‘yes’ 
or ‘no’ evoking questions. Conversely, any method of finding a thought of number 
allows us to associate with each of the numbers a sequence of numerals | and 
0, where the first numeral shows whether, in the case in which a given number 
is thought of, the first question is answered as ‘yes’ or ‘no’; the second numeral in 
exactly the same way indicates the answer to the second question, the third 


tA code that satisfies the stated condition is often called an instantaneous (or instanta- 
neously decipherable) code. This term is due to the fact that, in the case of other uniquely 
decipherable codes, to determine that we have come to the end of a recurrent code word we 
have to acquaint ourselves sometimes ‘or even always) with several succecding elementary 
signals, too (that is, the decoding is effected with a lag in comparison to the transmission of 
information). In the foregoing three examples of nonuniform codes, for a bicharacter alpha- 
bet with the code words 0 and !I1, or O and IC, or 0 and OI, the first two are obviously in- 
stantaneous codes but the third one is not (in the third case to clarify the meaning of the digit 
O in a long sequence of digits 0 and | that forms the encoded message it is necessary to know 
also the succeeding digit). 


142 4. APPLICATION OF INFORMATION THEORY 


numeral to the third question, and so on. Hence, any method of finding a thought 
of number leads to a binary code. The above formulated condition is obviously 
always satisfied here because from the fact that our method allows us to indic- 
ate uniquely the number thought of through answers to the questions asked, it 
directly follows that none of the code words obtained can emerge as a continua- 
tion of another notation. For example, the presence of the sequence ‘101’ among 
the code words implies that the answers ‘yes’, ‘no’ and ‘yes’ already completely 
determine the number and eliminates the possibility of the existence of the code 
word ‘10110.’ 

It is thus seen that the possible binary codes for an n-letter alphabet precisely 
correspond to all possible methods of determining one of the n numbers thought 
of by means of ‘yes’ or ‘no’ answerable questions. It is now not difficult to under- 
stand which code is of utmost advantage. We shall for the present measure 
the advantage (or, more aptly, the efficiency) of a given binary code in terms of 
the maximum number of elementary signals (equivalently, the digits 1 and 0) 
that are required for the transmission (or writing) of a single character: the less 
the maximum number, the more efficient is the code. A more precise definition 
of the ‘efficiency’ of a code is derived from the calculation of the average number 
of elementary signals corresponding to one character; this definition will be con- 
sidered in the next section.) In such a case, the problem of constructing a more 
efficient code coincides with the content of Problem 25. According to the solu- 
tion of this problem, the greatest number & of elementary signals that make up 
a character cannot be less than log n, i.e., at most it is defined by the inequal- 
ities (1) on p. 106. The necessity of the inequality kK > log # is implied by the 
elementary computations of information. In fact, one letter of an a-letter alpha- 
bet can contain log n bits of information (for this, it is necessary only that all 
‘letters’ of the message be independent of each other and each of them can take 
all yalues with the same probability). On the other hand, every elementary sig- 
nal to be transmitted that takes either of the two values (these being, say, either 
a current pulse or a pause) cannot contain more than | bit of information. 
Hence, for the transmission of one character not less than log m elementary sig- 
nals are needed. 

For constructing the most efficient binary code we can make use of the solu- 
tion of Problem 25. Namely, we partition our ‘characters’ into two groups as 
close to being numerically equal as possible; for all characters of the first group 
we take | as the first numeral of the code word and 0 as the first numeral for all 
characters of the second group. Furthermore, each of these two groups is again 
partitioned into two closest numerically equal groups and we take 1 as the 
second number of the code word if the corresponding character is contained in 
the first of these two smaller groups, and the symbol 0 if it is contained in the 
second of these groups. Then we partition each of the four already existing 
groups into two still smaller groups that are numerically equal as closely as is 
possible and, as before, we choose the third symbol of the code word, and so on, 
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By what has been stated in Chap. 3.1, we thus arrive at a binary code, for which 
the maximal number of numerals k in one code word is defined by inequalities 
(1) on p. 106, so that no code can be more efficient than this one. 

This obviously does not imply that there is no other code as efficient, i.e., that 
there can be only one most efficient code. In particular, it is clear that if we 
estimate the efficiency of a code consisting of the digits 0 and 1 by the Jongest 
code word, then we may not at all consider nonuniform codes. In fact, if the 
code is nonuniform, then we may add to the end of any code word, whose length 
is less than maximum, a certain number of arbitrarily chosen digits (say, only 
digits 0) to arrive at a uniform code that has the same maximum length of code 
word as the original nonuniform code. This deduction is vital for applications 
since uniform codes have an apparent practical advantage; they are considerably 
simpler to decode and here the decoding card be easily automatized. We note 
furthermore that there may be several! different uniform codes with the mintmum 
possible length of code words. After emphasizing their great practical import- 
ance, we describe here just one more method of constructing such codes, which 
is in essence quite similar to the code described. 

This method involves the use of the binary number system. Ordinarily, we use 
the decima] number system, in which every number is presented as the sum of 
the exponents of the number 10: 


n= a, X 10*+ a, X 10h24--- + a X 10 + a, 


where @y, @y-1,.. . , Gj, Gq are the digits of a number which can take values 
from 0 to 9; the number 2 is denoted here by a sequence of its digits, i.e,, as 
Gt0,-,...4,d. In analogy to this, the number x can also be represented as 


the sum of the exponents of the number 2: 
n= 5b, X 2'+ b-, x 2-14 5b, x2 + bo; 


here the ‘digits’ b:, b;-,,..., 5,, bg must be less than 2, i.e., they can take only 
the values 1 and 0. Ina binary number system the number is denoted by a 
sequence of appropriate ‘binary digits’; thus, for example, since 


6=1x2?4+1x234+0x 29 9=1x 274+0x 2740x241 x 20, 


therefore in the binary number system, the numbers 6 and 9 are written as 110 
and 1001, respectively. The numbers can obviously be represented also as the 
sum of the exponents of any other number m;, we thus arrive at an m-ary num- 
ber system, where the ‘digits’ can take m values 0, 1, 2,...,m— 1 (sucha 
system will be needed by us later on). 

A number k of digits in the usual (‘decimal’) notation of the number n is 
obviously defined by the inequalities 


10-2 Sn < 10%; 
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thus, a number in the interval between 10! = 10 and 107 — 1 = 99 has two digits, 
that between 10? = 100 and 103 — 1 = 999 has three digits, and so on. In 
analogy to this, a number k of ‘digits’ in the binary notation of number ni is 
defined by the inequalities 


2b aon = 2%. 


(Hence it follows directly, in particular, that the number 6 has three digits and 
the number 9 has four digits in the binary number system.) Therefore, if we 
write the first 1 integers starting from 0 (i.e., 0,1, 2,...,— 1), then it is 
found that with 2-1 <n < 2* binary notation of all these numbers will contain 
not more than k symbols and it is exactly k symbols that are at least once surely 
required by us. If we now add a definite number of zeros to the beginning of 
our binary notation of all less than ‘k-digit? numbers, we arrive at a uniform 
binary code for an n-letter alphabet with minimum possible length of code words. 
Thus, when n = 10, say, the corresponding code words are the following com- 
binations that represent an expression in the binary number system of all numbers 
from 0 to 9 and are supplemented, if necessary, by zeros at the beginning up to 
four symbols: 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001. Al! 
code words for any other n also are equally simple to construct by this method; 
no preliminary partitioning of a collection of m numbers into smaller groups is 
involved here.t 


We have shown that in the case of an n-letter alphabet the length of code words 
(i.e., the number of elementary signals contained in them) for the most efficient 
uniform binary code is the smallest integer k satisfying the inequality k > logan. 
We now note that if log n is not an integer, then code words of such a length 
can be used, in general, for the transmission of a greater amount of information 
than that really transmitted in the case of coding a message written by means of 
an n-letter alphabet. Consider, for instance, the case m = 10 (let us say the 
case for the transmission of numerical information). Every digit of the informa- 
tion being transmitted (written in the usual decimal number system) can take 
one of ten values, i.e., can contain information at most equal to log 10 = 33 bits, 
which is attainable for the case in which all digits of the message are independent 
of each other and each of them can take all values with the same probability. 
Every digit of the encoded message (i.e., every elementary signal being transmit- 
ted—say a current pulse or pause) can take either of the two values, i.e., can 
contain information at most equal to 1 bit (abridged from the words binary unit). 
But the use of uniform binary code involves sending four elementry signals for 


tIt is easy to see that when vn is an integral exponent of 2 (say, n = 8, n = 16, or n = 32), 
the code obtained with the aid of a binary number system identifies exactly with the one given 
in the solution of Problem 25. (When » = 10, the ‘binary code’ reduces to the solution of 
Problem 25, if the solution Legins with the question “Dues the number thought of exceed 8?”*; 
see p. 129.) 
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the transmission of one-digit information, and 4M elementary signals for the 
transmission of M-digit information. However, by means of 4M binary signals 
we can transmit information equal to 4M bits, i.e., information approximately 
$M bit greater than the maximum information which can be just contained in 
an M-digit number, i.e., equal to M decimal units of information. 

This phenomenon is straightforward to explain. The reason is that when 
n= 10 all symbols in an encoded message are never mutually independent and 
take both possible values with the same probability: these conditions can be sat- 
isfied only when 2 = 2*. In the paricular, if we use a code constructed with the 
aid of the expansion of numbers from 0 to 9 into a binary number system, then 
in thecase in which all digits in the original message are encountered with the same 
frequency, the digit 0 in an encoded message is encountered 32 = 3 times more 
frequently than the digit 1 (since it is easily verifiable that in the ten code words 
written on p. 144 the digit 0 is encountered 25 times and the digit 1 only 15 
times). However, for a sequence of given numbers of digits 0 and | to contain 
the largest amount of information, it is necessary that all digits of this sequence 
take both values with the same probability (and be mutually independent). 

For the transmission of a long numerical message it is, however, also possible 
to construct a more advantageous binary code. This necessitates only that we 
give up letter-wise coding (by ‘letters’ of which our message consists we of course 
mean the digits 0, 1, ..., 9) and use instead the so-called block codes, in which 
code words are associated with ‘blocks’ consisting of a fixed number of sequen- 
tial ‘letters’. We start with the simplest block of two ‘letters’, i.e., partition our 
message into sequential pairs of digitst and convert into a binary number system 
not every digit individually but each ‘two-digit’ number obtained under such 
partitioning. The number of binary symbols required for writing all two-digit 
numbers (from 00 to 99 inclusive) is equal to the number of questions needed 
for finding thought of number within the first hundred, i.e., it equals 7 (see 
Problem 25, p. 106). Thus, such a system of coding involves for two digits of 
message an outlay of 7 elementary signals (not 2 x 4 = 8, as earlier), i.e., for 
the transmission of a number containing M digits (for the sake of simplicity M 
is assumed to be even) it is necessary to send 3.5M elementary signals, or 0.5M 
signals less than those in the original system of coding. When it is required to 
transmit many digits (in the case of M being large) the advantage is found to be 
quite appreciable. 

It is even more advantageous to partition the number to be transmitted into 
blocks of three digits and switch over to a binary number system whenever 
‘three-digit? numbers are obtained in this process. For the transmission of a 
‘three-digit’ number it is obviously necessary to send 10 elementary signals (see 
p. 106) so that sucha method of coding allows us to transmit a number consisting 


tSuch a partition of a message into sequential pairs of digits is obviously equivalent to its 
conversion into a hundred-ary number system, 
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of M digits (in case M is a multiple of three) by means of 4°M = 34M element- 
ary signals. The advantage that can be had from taking recourse to the parti- 
tioning of a message into still larger blocks and converting each such block 
individually into a binary system is quite small in practice (in the passage from 
a block of three digits to a block of four digits the coding efficiency is even 
decreased : the transmission of four digits, as it is easy to see, involves 
14 = 3.5 x 4 elementary signals). Moreover, it is interesting to note that by 
applying the partitioning into sufficiently large blocks we can further ‘condense’ 
our code and make the'ratio of the number of elementary signals in the encoded 
message to the number of digits in the original (usually decimal) number arbitrarily 
close to the limit value equal to log 10 = 3.32193... In fact, by invoking the 
partitioning into blocks, of N digits say, we arrive at a code in which every N 
digit of information involves k elementary signals, where x is an integer satisfy- 
ing the inequalities 


k—1<log 10" <k, 


or, equivalently, 
N log 10 Sk < N log 10 + 1. 


Hence, it is seen that in such a code the average number k/N of elementary 
signals, per decimal digit, cannot differ from the quantity log 10 by more than 
1/N; if we choose N sufficiently large, we can make this difference arbitrarily 
small (see p. 106). 

Clearly, in the foregoing reasoning almost nothing is changed if the original 
message is not numerical but consists of ‘letters’ of an arbitrary n-letter ‘alphabet’ 
(for example, ordinary English letters, or Russian letters, or letters and digits, 
or letters, digits and punctuation marks, and so on). In this case, it is also 
reasonable to use the coding of big blocks of N such ‘letters’; for such a coding 
it is necessary only to expand the first 2” numbers into a binary system. This 
method makes it possible to achieve the result that the average number of element- 
ary signals necessary for one letter of message is arbitrarily close to the quantity 
log n (a simple calculation of the amount of information substantiates that our 
average number can never be less than this quantity). It is only in the case in 
which n is an integral power of 2(2* say) that such partitioning into big blocks 
is found unnecessary; a code can then be made optimal by associating each 
individual letter with some code word, so that a recourse to block coding confers 
no advantage. In this context we remark that in a certain sense ‘block coding’ is 
always less convenient than ‘coding by individual letters’: in block coding the 
decoding is naturally found to be more complex and laborious (the longer the 
code, the more this is so) and, moreover, it is always effected at the expense of 
the time required to decode (having received the coded message, it is not poss- 
ible to determine which is the first letter being transmitted until the succeeding 
N — 1 letters are transmitted). 
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All the arguments we have adduced easily carry over also to the case in which 
for transmission we make use of not 2 but m elementary signals (the case of an 
m-ary code). For constructing a most efficient uniform code the only require- 
ment here is that we use not a binary but an m-ary number system. If is an 
integral power of m, then the coding can be completely restricted to each letter 
of message individually; the number of elementary signals required for the trans- 
mission of one letter can also be made here to assume the least possible value, 
namely the value log n/log m. However, if n is not an integral power of m, then 
by assigning each letter of the message individually to a code word we have to 
send k > logn/log m elementary signals for every letter; here k is the least integer 
greater than log n/log m. In this case we can construct a more efficient code by 
using N-letter block coding; if we choose N sufficiently large, we can conclude 
that the average number of elementary signals required for the transmission of one 
letter of a message is arbitrarily close to log n/log m. In the particular case m = 3, 
the corresponding arguments will be similar to those deduced in Chap. 3.2 for 
the determination of the number of weighings on beam balances required to find 
a counterfeit coin (see pp. 108-109). In fact, since each weighing can have three 
outcomes, the result of a sequence of such weighings can be represented in the 
form of a sequence of digits, each of which takes one of the three valuesf, i.e., 
in the form of some number described in a ternary system. 


4.2. Shannon-Fano and Huffman codes. Fundamental coding theorem 


The basic results of the preceding section can be stated as follows: if the num- 
ber of letters in an ‘alphabet’ is n, and the number of elementary signals being used 
is m, then in any coding method the average number of elementary signals required 
per alphabet letter cannot be less than log n/log m; however, it can always be made 
arbitrarily close to this ratio, if we only associate directly sufficiently long ‘blocks’ 
consisting of large number of letters with the individual code words. From the 
conceptual view point, this result is obviously linked to the simple arguments 
stated by Hartley in 1928. It is obviously in no way related to any probabilistic 
considerations (in Sec. 4.1 the term ‘probability’ has not at all been mentioned) - 
and actually rests only on an elementary calculation of the number of ‘distinct 
N-letter sequences of an n-letter alphabet’ and ‘distinct sequences of N, element- 
ary signals’. Hence the results of Section 4.1 can hardly claim to establish the 
importance of information theory for the engineering problem of transmitting 
messages, of which we spoke in the Preface to the present book. 

The results of Sec. 4.1 can indeed be considerably improved if we make use of 
the concept of entropy, which we introduced in Chap. 2, and take note of the 
Statistical properties of actual messages. As a matter of fact, in Sec. 4.1 we 


fSince this is taken in a ternary system, these values can be denoted by the digits 0, 1, and 
2, but alternatively the letters E, R, and L can also be used (see Chap. 3.2). 
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characterized quite roughly the efficiency of a code as the greatest number of 
elementary signals involved per letter of the message to be coded, and for this 
we considered only the simplest codes, i.e., uniform codes. If at the end of that 
section we talked also of the average number of signals involved per letter of 
message, this was connected only with the fact that there the uniform codes were 
considered for multiletter blocks and the ratio of the number of elementary 
signals in a code word to the number of letters in the corresponding block (which 
we called the average number of elementary signals per letter) could not be an 
integer. But, in practice, we usually have to deal with messages in which the 
relative frequencies of different letters differ considerably from each other (it 
suffices to compare, say, the frequencies of the letters e and y in any English text; 
we shall elaborate this in Chap. 4.3). Hence the role of key value must be 
occupied here by the probabilistic mean (or, average) value of the number of 
elementary signals involved per letter of message which is defined in accordance 
with the actual statistical laws characterizing the message to be transmitted. 

Let us now examine the problem of the coding of messages that obey definite 
statistical laws. We consider here only the simplest case of messages written by 
means of some n ‘letters’, the frequencies of whose occurence at any point in 
the message are completely characterized by the probabilities p,, po, ..., Dn, 
where, obviously, p, + po +... + pa = 1. The simplification which we use 
here is that the probability p, of the occurrence of the ith letter at any point in 
the message is assumed to be one and the same irrespective of what letters occur 
at preceding points; in other words, the successive letters of the message are 
considered to be independent of each other. Factually, in actual message this 
does not happen very often; in particular, in the English language the probability 
of the occurrence of a letter essentially depends on the preceding letter (see 
p. 181 below et seq.). However, if we were to take into rigorous account the 
mutual dependence between letters, it would highly complicate all our further 
discussions; at the same time, it is natural to think that this should not alter the 
results deduced below since, if desired, by ‘letters’ we can straightaway under- 
stand multiletter blocks whose dependence on each other is already comparatively 
weak. 

We shall consider for the present only binary codes; an extension of the corres- 
ponding results to codes that utilize an arbitrary number m of elementary signals 
is as usual quite simple. A brief discussion at the end of the section will suffice for 
this purpose. We start with the simplest case of codes that associate every ‘letter’ 
of a message to an individual code word, a sequence of digits 0 and 1. It has 
already been remarked above that some method of finding a thought of number 


fIt can indeed be shown that all the results presented below are preserved for a very wide 
class of cases, in which the successive letters of a message are dependent on each other (see 
pp. 161-62 below), 
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x not exceeding n by means of ‘yes-or-no’ questions can be associated with every 
binary code of an v-letter alphabet; conversely, any method of determining sucha 
number leads us to a definite binary code. When probabilities p;, po,..- Da 
are assigned to individual letters, the transmission of a multiletter message 
corresponds precisely to the situation described on p. 131 et seq.: the optimal 
code in such a case is associated with the method for finding a number x for 
which, with the same n probabilities of the values of x, the average value of the 
number of questions asked for, is found to be the least. This average value can 
itself be considered also as the average value of the number of binary symbols 
(digits 0 and 1) in a code word; in other words, it precisely equals the average 
value of the number of elementary signals per code letter in the transmission of a 
multiletter message. 

It is now possible to apply directly to our problem the results set forth on p. 131 
et seq. According to these results, in the first place the average number of binary 
elementary signals per letter of the original message in the encoded communication 
cannot be less than H, where H = —p, log p, — pe log pz — ... — pn log pn iS 
the entropy of the experiment which consists of distinguishing one letter of 
the text (or briefly, the entropy of one letter). This directly implies that for any 
coding method for writing a long message of M letters we require not less than 
MH binary symbols. This statement is immediate from the fact that the informa- 
tion contained in a piece of text of M letters is equal to MH in our case (recall 
that the individual letters are considered to be mutually independent); at the 
same time the information contained in one elementary signal (binary symbol) 
cannot exceed one bit in any way (see p. 132; a variant derivation of the same 
result is given in small print on pp. 134-36). 

If the probabilities p,, p»,..., Pa are not all equal among themselves, then 
H < logn. Hence, it is natural to think that by taking account of the statist- 
ical laws of a message we can construct a code more efficient than the best uni- 
form code which, according to the results of Sec. 4.1, involves not less than 
M log n binary symbols for writing a text of M letters. The procedure used to 
obtain an optimal code is clear from what was stated on pp. 131-32. It is con- 
venient if, to start with, we arrange all existing n letters in one column in order 
of decreasing probability. Then, all these letters should be divided into two 
groups of higher and lower probabilities, so that the total probabilities of the 
letters of the message belonging to either of these groups should be as close as 
possible to each other; for letters of the first group we use 1 as the first digit of 
the code word, and for those of the second group the digit 0. Furthermore, 
each of the two groups obtained should again be divided into two parts with 
total probabilities as close as possible to each other; we use either 1 or 0 as the 
second digit of the code word according to whether our letter belongs to the 
first or the second of these smaller groups. Then, each of the groups containing 
more than one letter is again divided into two parts of closest possible total 
probability, and so on: the process is repeated until we arrive at groups each of 
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which contains only one letter. Such a method of coding message was first 
Suggested in 1948-1949 by R. Fano and C. E. Shannon independently of each 
other; hence, the corresponding code is usually called the Shannon-Fano code 
(sometimes simply the Fano code).t Thus, for example, if our alphabet contains 
altogether six letters whose probabilities (in decreasing order) are 0.4, 0.2, 0.2, 0.1, 
0.05 and 0.05, then in the first step of division of letters into groups we separate 
only the first letter (first group), leaving all the rest in the second group. Furth- 
ermore, the second letter forms the first subgroup of the second group; how- 
ever, the second subgroup of that group consisting of the remaining four letters 
is also again successively divided into parts such that every time the first part 
consists of only one letter (see the accompanying table). Similarly, in the table 


TABLE 


No. of letters Probabilities Partition into subgroups. Code words 
The Roman digits signify 
the numbers of groups and 


subgroups 
1 0.4 } I 1 
2 0.2 } I 01 
3 0.2 } I 001 
4 0.1 II 1 } I 0001 
5 0.05 i Il Il } I 0000! 
6 0.05 } } II 00000 


on the next page we analyze the case of a richer ‘alphabet’ containing 18 letters 
with probabilities 0.3, 0.2, 0.1 (2 letters), 0.05, 0.03 (5 letters), 0.02 (2 letters), 
and 0.01 (6 letters). 

The basic principle of the Shannon-Fano coding method is the following: in 
the choice of each digit of a code word we wish to ensure that the amount of 
information contained in it be as large as possible, i.e., that independently of all 
preceding digits this digit may take either the value 0 or 1 with almost equal 
probability. The number of digits in different code words is obviously found 
here to be different (in particular, it varies from one to five in the first example 
and from two to seven in the second example), i.e., the Shannon-Fano code is 
nonuniform. It is, however, easy to understand that no code word here can be 
found to be the prefix of other longer word (this is also clear from the fact that 
such a code actually coincides with the method described on p. 131 et seq. for 
solving the problem of finding a thought of number; see pp. 141-42). Hence, 
an encoded message is always uniquely decipherable. It is quite essential that 
in the Shannon-Fano code we assign shorter code words to higher probability 


tTo be more exact, this method of coding was in fact proposed by R. Fano alone; C. E. 
Shannon, however, offered a slightly different method similar to the one described above. 
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letters than to low probability ones (because in the successive group divisions 
higher probability letters are then separated more speedily into the individual 
one element groups; see the examples analyzed above). As a result, although 
certain code words may also have quite a significant length here, the average 


Pre 
Number Probabilities Partition into groups Code words 
of letters 
, BEY f 
; II 10 
3 0.1 9 II O11 
4 0.1 I } I 0101 
5 0.05 } II 311 0100 
6 0.03 = | 7} \ 1 Hl 00111 
H 003 | ; | a 00101 
9 0.03 | | ae } II 3 00100 
10 0.03 | | 4 1 00011 
11 0.02 Pi | I YT 000101 
12 we | Ln | fn } 11 000100 
i 001 | | | II | I 1 } I 0000101 
15 0.01 | | Tl 3 1 0000100 
16 0.0! | | | J] YI 000001 
17 0.01 | | | Il }1 0000001 
18 001 | J J | }u } If 0000000 


value of the length of such words is nevertheless found to be only slightly greater 
than the minimal value H admissible for messages in order to preserve the 
amount of information in coding. Thus, for the six-letter alphabet example 
considered above, the best uniform code consists of three-digit code words (be- 
cause 2? < 6 < 23), and hence in that case we assign exactly three elementary 
signals to each letter of the original message; however, in using the Shannon- 


Fano code the average value of elementary signals per letter of message is given 
by 


1x04+2x024+3x02+4+4 x 0.14 5 X (0.05 + 0.05) = 2.3. 


This value is appreciably less than 3 and is not very far away from the corres- 
ponding entropy value 


= —0.4 log 0.4 —2 X 0.2 log 0.2 — 0.1 log 0.1 — 2 x 0.05 log 0.05 = 2.22. 


In analogy to this, for the 18-letter alphabet example considered, the best uni- 
form code consists of five-digit code words (since 24 < 18 < 25); however, in 
the case of the Shannon-Fano code there are letters that are coded by as many 
as seven binary signals but, on the other hand, the average value of elementary 
signals per letter is given by 


2x05+3x0.14+ 4X 0.154 5 X 0.15 + 6 X 0.06 + 7 X 0.04 = 3.29, 
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The preceding value is appreciably less than 5 and it does not deviate much 
from the quantity 


H = —0.3 log 0.3 — 0.2 log 0.2 —...— 6 X 0.01 log 0.01 = 3.25. 


A special advantage from the Shannon-Fano method is derived when it is used 
for coding the blocks of several letters and not the individual letters of an alpha- 
bet. It is true that here it is nevertheless impossible to exceed the limit value 
H of binary symbols per letter of message (because, for the case in which the in- 
dividual letters are independent, the entropy of an N-letter block equals NH 
and, consequently, in any coding method, there can occur on the average not 
less than NH binary signals per block). However, even in comparatively un- 
favourable cases block coding enables us to approach this minimal value rather 
quickly. Consider, for example, the case in which there are only two different 
letters A and B with probabilities p(A) = 0.7 and p(B) = 0.3. Then 


H = —0.7 log 0.7 — 0.3 log 0.3 = 0.881... 


Here the application of the Shannon-Fano method to the original two-letter 
alphabet is in fact meaningless: it merely leads us to the following simplest uni- 
form code: 


Letter Probabilities Code words 
A 0.7 1 
B 0.3 0 


This code requires for the transmission of each letter one binary symbol, this 
being 13.5% more than the minimal attainable value 0.881 binary digits per 
letter. However, by applying the Shannon-Fano method to the coding of all 
possible two-letter combinations (whose probabilities are determined by the 
multiplication rule of probabilities for independent events, see p. 18), we arrive 
at the following code: 


Letter combination Probabilities Code words 
AA 0.49 1 
AB 0.21 01 
BA 0.21 001i 
BB 0.09 000 


The average value of the length of code words here is 


1 x 0.49 +2 x 0.21 + 3 x 0.30 = 1.81. 
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Hence, in this case, we need on the average 1.81/2 = 0.905 binary symbols per 
alphabet letter, which exceeds by only 3% the value 0.881 binary digits/letter. 
We obtain still finer results by applying the Shannon-Fano method to the'coding 
of three-letter combinations. This leads us to the following code: 


Letter combination Probabilities Code words 

AAA 0.343 11 

AAB 0.147 10 

ABA 0.147 Ol 

BAA 0.147 010 

ABB 0.063 0010 
BAB 0.063 0011 
BBA 0.063 0001 
BBB 0.027 0000 


The average code-word length value is here 2.686, i.e., on the average 0.895 
binary symbols: per letter of text are needed, which is only 1.5% more than the 
limit value H = 0.881 binary digits/letter. 

When the difference in the probabilities of the letters A and B is still larger, 
an approximation to the minimal possible value of H binary digits/letter may 
be somewhat less'rapid, but it is nevertheless reasonable. Thus, when p(A) = 
0.89 and p(B) = 0.11, the value of H is —0.89 log 0.89 — 0.11 log 0.11 = 0.5 
binary digits/letter, while the uniform code A — 1, B > 0 (equivalent to the 
application of the Shannon-Fano code to a set of two existing letters) involves 
an outlay of one binary symbol for each letter and is twice as long. However, 
it is easy to verify here that the application of the Shannon-Fano code to all 
possible two-letter combinations leads to a code in which 0.66 binary digits on 
the average are necessary per letter. The application of this very code to all 
three-letter blocks allows one to lower the average number of binary digits per 
letter to 0.55. Finally, the coding by the Shannon-Fano method of all possible 
four-letter blocks involves on the average an outlay of 0.52 binary digits per 
letter, i.e., overall only 4% more than the minimal value of 0.50 binary digits 
per letter. 

The Huffman code is closely related to the Shannon-Fano code, but it is more 
advantageous of the two (see [66]). We now proceed to describe this code. The 
construction of this code rests on a simple transformation of the alphabet in 
which the message to be transmitted over communication channels is written. This 
transformation is called the contraction of the alphabet. Suppose that we have 
an alphabet A containing the letters a,, @,,...,@n whose probabilities of occur- 
rence in the message are py, Po,-.., Pn, respectively; moreover, we consider 
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the letters to have been arranged in order of decreasing probability (or fre- 
quency), i.e., we assume that 


Pi 2 Po 2 Py > ~ 1+ > Pa-i P Pr. 


We now agree not to make a distinction between two least probable letters of our 
alphabet, i.e., we consider that a,-, and a, are one and the same letter b of a 
new alphabet A, which obviously contains the letters a,, a,..., Qn-2 and Db 
(i.e., either a@,-1 OF dn), whose probabilities of occurrence in the message are p,, 
Pos +++» Pn-g and Py; + pn, respectively. The alphabet A, is also called the 
alphabet obtained from A by contraction (or one-fold contraction). 

The term ‘one-fold’ carries here the following sense. We arrange the letters 
of the new alphabet A, in order of decreasing probability and carry out the con- 
traction of alphabet A,. We then arrive at an alphabet A, of which it is natural 
to say that it is obtained from the original alphabet A by two-fold contraction 
(and from A, by a simple or one-stage contraction). It is clear that A, contains 
in all a — 2 letters. The continuation of this process leads us to increasingly 
shorter alphabets so that after (n — 2)-fold contraction we arrive at an alphabet 
A,-2 containing two letters in all. By way of an example, our earlier mentioned 
alphabet containing 6 letters with probabilities 0.4, 0.2, 0.2, 0.1, 0.05 and 0.05 
is transformed by successive contractions into the accompanying table. 


TABLE 
No. of Probabilities 
letters Original Contracted alphabets 
alphabet 
A Ai Ag Ay Ag 
I 0.4 0.4 0.4 0.6 
2 0.2 0.2 0.2 0.4 | 0.4 
3 0.2 0.2 0.2 | co 
4 0.1 0.1 0.2 — 
5 0.057 ;—-70.1 = 
6 0.05 - 


We now agree to assign the code words 1 and 0 to the two letters of the last 
alphabet A,-.. Furthermore, if code words are assigned to all letters of alphabet 
A,, then to the letters of the ‘preceding’ alphabet Aj_, (where, obviously, A,-1 
= A, is the original alphabet A), which are also the letters of the alphabet Aj, 
we assign the same code words as they had in the alphabet A;. However, to 
the letters a’ and a” of alphabet Aj; ‘coalesced’ into a single letter b of alphabet 
Aj-1 we assign the words obtained from the code word of letter 6 with the 
addition of digits 1 and 0 at the end; see the following table: 
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TABLE 
No, of Probabilities and code words 
letters Original alphabet Contracted alphabets 

A Ai Aa A3 Ag 

1 0.4 0 0.4 0 0.4 0 0.4 0 —0.6 1 
2 0.2 10 0.2 10 0.2 10 ,—04 I! | 04 0 
3 0.2 111 0.2 111 0.2 111 _l 0.2 10 |< 
4 0.1 1101 0.1 1101 (0.2, 110 
5 0.05 110017 |—0.1 1100 |+- 
6 0.05 11000 }< 


It is easy to see that the very construction of the Huffman code thus obtained 
implies that it satisfies the general condition enumerated on pp. 140-141: no 
code word is here the prefix of another lengthier code word. We also note that 
the coding of a certain alphabet by the Huffman method (likewise by the 
Shannon-Fano method as well) is not a uniquely defined procedure. Thus, for 
example, at any stage of the construction of the code we can obviously replace 
the digit 1 by 0 and vice versa; then, we obtain two different codes (which 
obviously differ quite insignificantly from each other and have the same length 
for all code words). But, apart from this in certain cases we can construct also 
some Huffman codes that are substantially different; thus, for instance, in the 


example analyzed above a code can also be constructed according to the accom- 
panying table. 


TABLE 

No. of ee Probabilities and code words 

letters Original alphabet Contracted alphabets 

A Ai Ag As Ag 

1 0.4 11 0.4 il 04 11 >0.4 0 >0.6 1 
2 0.2 Ol 0.2 01 0.2 10 | 0.4 11 | 0.4 0 
3 0.2 00 0.2 00 | 02 01 7,1 0.2 10 
4 0.1. 100 0.1 101 1. / 0.2 00 | 
5 0.05 1012). 0.1 100 
6 0.05 1010 


The new code obtained here is also a Huffman code; but the code-word lengths 
are now entirely different. However, note that the average number of elementary 
signals per letter for both the Huffman codes constructed is precisely identical, 
getting 


1x0442x02+3x02+4x 01+ 5 X (0.05 + 0.05) = 2.3 
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in the first case, and 
2 x (0.4 + 0.2 + 0.2) + 3 x 0.1 +4 x (0.05 + 0.05) = 2.3 


in the second case. 

Furthermore, it is clear that both the Huffman codes considered are highly 
effective (the average code-word length here is the same as that obtained above 
in the Shannon-Fano method). It can also be shown that the Huffman code is 
the most effective of all possible codes in the sense that in any other method of 
coding the letters of an alphabet the average number of elementary signals per 
letter cannot be less than that obtained in the Huffman coding method. (Let us note 
that this directly implies also that in any two Huffman codes the average code- 
word length must be precisely the same—indeed, both happen to be optimal.) 


The proof of this optimality property of Huffman codes is quite simple. We consider again 
any n-letter alphabet (we denote it by B, say) containing the letters 5,, by,..., Ba-1, 5, with 
probabilities q,, qa, -- +» Fn—1» Yn, Where 


91> 42 >+-+- > 41> Im (*) 


and obtain from it by contraction an (a — 1)-letter alphabet (alphabet B,) containing the 
letters by, bo,..., By-2, c, whose probabilities of occurrence are, respectively, g1, q2,.- +» In—a 
Qn—1 + Gn = GQ. Assume now that we have some system of code words for the letters of alpha- - 
bet B,. Then we carry over this code word system also to alphabet B by retaining the words 
of all letters that appear simultaneously in both alphabets and forming code words for letters 
b,-, and 6, by adding ! and 0, respectively, to the end of the code word of letter c. We now 
must show that if the code for the alphabet B, is optimal, then the code obtained for the alphabet 
B in this manner is also optimal. 

To prove the italicized statement we suppose that the code obtained for B is not optimal 
and show that in such a case the original code for B, also cannot be optimal. In fact, we 
denote by L, and L the average code-word length of letters (i.e., the average number of element- 


ary signals per letter) for the codes corresponding to B, and B, respectively. It is obvious 
that 


ESL #G. (**) 


Indeed, B, and B differ only in that the letter c of B, with probability ¢ is replaced in B by two 
letters b,_, and 5, with the same total probability of occurrence g (= gy-1 + Gn); however, 
the code-word lengths corresponding to these alphabets differ only by an increase per unit of 
the lengths corresponding to the letters b,_, and 6, in comparison to the length corresponding 
to the letter c of B,. Hence, the relation (**) also follows immediately from the definiton of 
the average code-word length. 

It has been assumed that the code corresponding to alphabet B is oft optimal. In other 
words, there exists an optimal code other than the one under consideration which associates 
with the letters b,, bs, . .. , bn-1, bn code words of length (in elementary signals) y, ke, . . 
ky-1, k, such that in it the average code-word length 


L’ = kg, + Kage + ~~~ + KnrGn-a + Kndn 
is less than L. Wecan also consider that 


ky Ska S11. < Kai < Ky. (***) 
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In fact, if the letters b; and b; (where i and j are any two of the numbers 1, 2, ..., n) are such 
that q; > 43 (which, because of (*) implies the inequality i < j) and k; > k;, then we simply 
interchange the code words of b; and b;, after which the average code-word length of a letter 
is further decreased; hence if q; > q;, then necessarily k; < k;. Now, within a group of letters 
bu, Dus, ---, Oy (where 1 <u <u <a) such that gy, = gys3 =.-. = Gy, we can always 
arrange the letters in such an order that ky < kyu <<... < ky. 

From inequalities (***) it follows, in particular, that a codeword having the greatest length 
k, corresponds to the letter b,. Furthermore, we can be convinced of the existence of such a 
letter 6; of the alphabet B, whose code word is obtained from the code word of b,, by replacing 
the last elementary signal (either | by 0, or 0 by 1). In fact, if such a code word were altogether 
absent, then we could simply discard the last elementary signal in the code word of b,, without 
violating the basic condition given atop p. 141 that defines an instantaneous code (recall that 
we have no letter whose code word is longer than b,). But this would again decrease the 
average length of the code word of a letter which contradicts the assumption of the optimality 
of the code under consideration. 

However, from inequalities (***) and the equality k, = k,, it follows that inevitably k, = ky_1 
(but this does not necessarily imply that 7 = n — 1). We now interchange the code words of 
b, and 6,., if/ 4 2 — 1 (if! =” — 1, then this step in the reasoning becomes superfluous); 
here the quantity L’ obviously remains unaffected. We now pass from the code for B to the 
code for alphabet B, by retaining the code words of all letters b,, b.,..., b,-,, and assigning 
to the letter c the code word obtained from the code words of letters b,_, and b, with the last 
digit removed (by which alone these two code words differ). It is obvious that the average 
code-word length L, of a code for the alphabet B, obtained in this manner is related to the 
average word-length L’ of a code for B by the following relation similar to (**): 


Vali +4¢. 


Hence, the inequality L’ < L implies that 


Li < Jy. 


But this also shows that the original code for B, is not optimal. 

We have as a matter of fact already completed the proof of the optimality of the Huffman 
code, It is indeed clear that the code taken by us for the last alphabet A,,_,, which assigns to 
the two letters of this alphabet the code words | and 0, is optimal: the average code-word 
length ! of a letter corresponding to it can in no way be decreased. But this implies by what 
has been proved that the code for alphabet A,_, is also optimal, whence, in turn, follows the 
optimality of the code for A,_4, and so on till the last code (the Huffman code) corresponding 
to the original alphabet A,_, = Ag, i.e., alphabet A. 


The degree of proximity between the average number of binary symbols per 
letter of a message and the value H attained in the examples considered above 
can be further increased arbitrarily by taking recourse to the coding of increas- 
ingly lengthier blocks. This flows from the following general statement which 
we shall hereafter call the fundamental coding theorem{: in coding a message 
segmented into N-letter blocks, it is possible by choosing N sufficiently large to 
assure that the average number of binary elementary signals per letter of the 


tTo be more exact, it should be designated as the fundamental coding theorem for the noise- 
less channels. The extension of this result to the problem of the most advantageous coding, 
taking account of the impact of noise, is considered in Sec. 4.4. 
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original message is arbitrarily close to H (in other words, arbitrarily close to the 
ratio of the amount of information H contained in a letter of the message to 1 
bit, i.e., to the greatest amount of information that can be contained in one 
elementary signal). Differently, this can also be formulated thus : a quite long 
message Of M letters can be encoded by means of the number of elementary signals 
arbitrarily close to (but obviously in no case less than) MH, if only this message 
is divided beforehand into sufficiently long blocks of N letters and separate code 
words are straightaway associated with all blocks. We further note that it is not 
by accident that we have not stated anything here as to precisely how we should 
construct N-letter blocks: as seen in the following, the methods for block coding 
may be highly diverse (thus, for example, it is possible to follow either the Huff- 
man or the Shannon-Fano coding method, but these are by no means the only 
possibilities open to us). Thus, the partitioning of a message into quite lengthy 
blocks plays a central role in the construction of an optimal code. It will be seen 
in Section 4.4 that direct block coding is of considerable advantage in the case 
of noisy channels, too (though the coding method itself has to be substantially 
modified in that case). 

In view of the crucial importance of the fundamental coding theorem, we shall 
now give two completely different versions of its proof (both due to C. E. Shan- 
non). The first essentially rests on the use of the Shannon-Fano coding method 
though, as we shall see later, a direct appeal to this method is not made in the 
proof. It is presumed for the present that under the successive divisions of the 
collection of letters to be coded (which can also be understood as entire ‘blocks’) 
into smaller groups, which forms the basis of the Shannon-Fano coding, we 
succeed each time in attaining the result that the total probabilities of both the 
groups obtained are precisely equal to each other. In such a case, the first, 
second, ..., /th divisions yield the groups whose probabilities sum to 4, 3,..., 
1/2', respectively. The /-digit code word has here those letters which were found 
to have been extracted in the one-element group after exactly / divisions, i.e., the 
letters whose probability is 1/2'. In other words, subject to the fulfilment of this 
condition the code-word length I; is related to the probability pi of the correspond- 
ing letter by the formula 


l 1 
Pi= aur | = log ra —log Pi- 


In fact, our condition can be precisely satisfied only in certain exceptional cases. 
The preceding formula directly implies that here the probability p: of all letters 
of the alphabet must be unity divided by an integral power of the number 2. 
But in the general case the quantity —log p,, where p; is the probability of the ith 
letter of the alphabet, is, as a rule, not an integer. Hence, the code-word length 
I; of the ith letter cannot be equal to —log p;. However, since in the Shannon- 
Fano coding method we successively divide our alphabet into groups of closest 
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possible total probability, the code-word length of the ith letter in such coding 
shall be close to —log py. We denote by /; in this connection the smallest integer 
not less than —log p;, i.e., such that 


—log pi < kk < —log p; + 1. (A) 
Inequality (A) can be rewritten as 
—l: < logp; < —(4 — 0), 


or 
1 1 
iy p< area (B) 


Let usnow show that there exists a coding method in which the code-word length 
_ of the ith letter exactly equals this number li. It is just this fact (and not the 
description of the corresponding coding method)f that is needed by us in the 
proof of the fundamental theorem. 

We first show that in the case of any n numbers 1,, lb, ..., In, satisfying the 
inequality 


1 ] 1 
on toe tee tan <b () 


there exists a binary code for which these numbers are the lengths of code words 
corresponding to n letters of some alphabet. In fact, let n,, m,..., nz be those 
of the numbers 1,, ,,..., Jn which are, respectively, equal to 1, 2,...,k 
(where nm, + m, +... + m = n, So that k is the maximum value of the num- 
bers 1,, /,,..., In). In this case, inequality (1) can be written in the form 


oy 


Mey Ma My 
7 +4 T3 +... t+ pe SI. 


Hence it immediately follows that 


ny 


3 <1, or m & 2; 

Mt jo, TM, or mz, < 22 — mn); 

4 2 

n n n 

3 <!-yz- ZG oor me < 24 — Qn, + mh 
OR gg nN a MG MR se SL 

Dk <1 y) 4 8g erst Qk-1 ? OF 


Me & 2[2*-+* — (2* a, + 2*-3n, +... + my-3)] 


+ Regarding this description, see the text in small print on p. 173 et seq. 
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(see p. 135). It is, however, clear that the condition n, < 2 guarantees the . 
possibility of the choice of n, distinct code words of length 1. In analogy to 

this, the inequality w, < 2(2 — n,) indicates the possibility of choosing addi- 

tionally n, code words of length 2 starting with a binary digit other than that 

which is already ‘taken up’ by the code words of length 1; as a matter of fact, 

the number of such ‘free’ first digits equals 2 — m, and to each of them we can 

add at the end either the digit 0 or 1. Exactly in the same way, the inequality 

nz <2 [4 — (2n, + n,)] allows us to choose additionally n, code words of length 

3, whose first digit is other than the 1, digit ‘taken up’ by the code words of 
length 1 and the first two digits differ from the n, two-digit numbers ‘taken up’ 

by the code words of length 2. (In fact, 21, + n, is a number of two-digit binary 

numbers which either starts with one of the n, digits, it being the code word of 
length 1, or coincides with one of the n, code words of length 2, and 4 is the 

number of all possible two-digit binary numbers with which, in principle, we can 

Start the code word of length 3.) Continuing this reasoning, it is easily seen that 

the inequality 


Me S&S 2[2*-1 — (2%-2n, 4+ 2! 9n, + 22. + m-)] 


allows us to choose m code words of length k, whose first digit, first two digits, 


first three digits, .. . , coincide with none of the ny, m,, mg, ..., code words of 
length 1, 2,3,..., respectively. In fact, 2* is the number of all possible 
initial combinations of k — J binary digits and 2*-’n, + 2*-°n,g +... + m1 i8 


the number of such combinations that are already ‘taken up’ (see p. 135). This 
leads precisely to the conclusion that the fulfilment of inequality (1) assures the 
possibility of choosing n code words of length /;, h,..., /, Satisfying the con- 
dition enumerated atop p. 141 in italicized print; these are precisely the code 
words we can associate with the existing letters of n-letter alphabets. 

For completing the existence proof for the required codes, it remains to note 
only that, by inequality (B) defining the code-word length /;, we have 1/2'* < p; 
for alli = 1,2, ..., m, where p; is the probability of the /th letter. Thus, 


1 I 1 
on + oe t+. + py SP tm te... tp =i. 


2 
Hence the numbers h, |,,. . : , /; indeed satisfy inequality (1), which is prerequi- 
site for them to be the code-word lengths of a binary code. 
The proof of the fundamental coding theorem can now be completed quite 
easily. In fact, the average number / of binary signals per letter of the original 


message (in other words, the average code-word length) is, by definition, given 
by the sum 


= ph + Pale +... + Prly. 
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We now multiply by p, inequality (A), defining the quantity /;, sum up all the 
inequalities so obtained corresponding to the values i = 1, 2,..., a, and note 
that 


H = —p, log p: — Pz log pz — . . - — Pn log pn, 


where H = H(a) is the entropy of the experiment « consisting of determining 
one letter of the message, and that p, + p, +... + pn = 1. Consequently, 


H<el<H+1. 


We now apply this inequality to the case in which the method set forth above 
is used for coding all possible N-/etter blocks (which can be considered as ‘letters’ 
of a new alphabet). By virtue of the assumption that successive letters of the 
message are independent, the entropy of experiments «a, . . .«~ considered in 
the determination of all letters of a block is given by 


Ha%2... a) = H(a,) + H(%) +... + Haw) = NH(a) = NH. 


Consequently, the average code-word length /w of N-letter blocks satisfies the 
inequality 


NH < ln < NH + 1. 


But in coding N-letter blocks the average number / of binary elementary signals 
per letter of message is equal to the average code-word length Iw of one block 
divided by the number N of letters in the block: 


Hence in such coding 


1 


i.c., the average number of elementary signals per letter differs here from the 
Minimum value of H by not more than 1/N. Letting N > oo, we immediately 
arrive at the fundamental coding theorem. 


Before we proceed further, we note that the proof deduced here can be 
applied also to the more general case in which the successive letters of a text are 


mutually dependent. For this we must rewrite the inequality for the quantity /w 
in the form 


H”) < ly < HY) +1, 
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where 


HS) = H(a,0,0%,...4N) | 
= (4) aE Ha,(%2) 35 Ha,a,(%3) +...4+ Haya... eva (aw) 


is the entropy of N-letter block which, in the case of the letters of a message 
being dependent upon each other, is always less than NH (because H(«,) = H 


and H(a,) > Ha,(%.) > Ha,a,(a,) >... 2 Haa,...ay_,(an)). This implies 
that 


HH) HW) l 
Nae aye Ne 


where /is the average number of elementary signals per letter of message. Hence, 
in this general dependent case, as N — oo (as the block length increases indefini- 


tely) the average number of elementary signals required for the transmission of one 
letter tends unboundedly to the quantity Hoo, where 


H™) 
A. = lim 
N>o N 


is the ‘specific entropy’ per letter of a multiletter text (we shall discuss the quantity 
H.. more elaborately later in the next section).+ 


We now give the second proof of our fundamental coding theorem; the succes- 
sive letters of message are again considered here to be mutually independent. 
This proof is lengthier than its predecessor, but then it is more instructive since 
it makes transparent the meaning of the concept of entropy itself (see pp. 55-56). 
In addition, this new proof shows us that, even in the case of sharply differing 
probabilities of different letters, when coding very long blocks we can always 
make use of ‘almost uniform’ codes by associating with all blocks code words 
of the same length, except for a certain part of them having a negligibly small 
probability sum. As regards the latter ‘low-probability’ blocks, it is easy to 
understand that they can be coded on an ‘as and when occurring’ basis: since 
the probability of the occurrence of any such block is quite small, the method 
of coding these blocks is of no significant importance. 

For greater clarity we start our proof with a detailed examination of the sim- 
plest case in which the entire ‘alphabet’ consists in all of two letters a and b with 
probabilities p, = p and p, = 1 — p=q. We shall code all possible sequences 
(‘blocks’) consisting of N successive letters a and b. The total number of such 
distinct N-term sequences is 2" (see pp. 55-56). However, a majority of these 


tThe existence of the limit Hoo directly follows from the inequality H(«,) > Hy,(«2) > 
Ha,a(%) > -.., which shows that H(a) = H), (H(2)/2), (H@)/3),..., (H(N)/N),... is a 
monotonically nonincreasing sequence of positive (i.e., greater than Zero) numbers, 
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N-term sequences have negligible probability. Since the relative frequency of 
the occurrence of letters a and b is p and gq respectively, for a sufficiently large 
N an aggregate of only those sequences will have a significant probability, in 
which of the total N numbers of letters the letter a occurs roughly Np times and 
the letter 5 occurs the remaining roughly N — Np = Nq times. To be more exact, 
it can be stated that when N is quite large all sequences to which the relative 
frequency of occurrence of 2 is not confined to the range from p ~ ce top-+e, 
where e is an arbitrarily chosen very small number (say 0.001, or 0.0001, or 
0.000001; for, « can take any of these or even any still smaller number, if only N 
is sufficiently large), have an extremely small probability sum so that in general 
they can be ignored in calculation. As to the sequences in which a occurs in 
the range N(p — e€) to N(p 4- €) times, obviously each such sequence also has a 
small individual probability (for large N the total number of possible sequences 
is very large, but the probability of each of them individually is quite small), yet 
the probability sum of all these sequences is quite close to 1. 

Let us now note that the number of N-letter sequences, in which a is encoun- 


tered exactly Np timesf, is equal to the number ( Ne ) of combinations of N 
elements taken Np at a time (i.e., the number of Np-element subsets of a given 
set of N elements). This makes it necessary to estimate the quantity ( : ) (see 


footnote ¢ below) with its dependence on N and K. 
In order to make clearer the idea underlying our reasonings, we announce 


first the derivation (not needed later by us) of the formula for the number a 


Suppose that we have N (paper) contours and N different colours, with which we 
desire to colour these contours—each in its own colour. Since we can paint first 
contour in any of the N available colours, the second in any of the remaining 
N — 1 colours, the third in any of the N — 2 colours not already used, finally 
the last contour in the only colour left at our disposal, the total number of 
possible contour colourings is 


N(N — 1)(N — 2(\N —3)...1=N1. 


Now let us call any K colours to be the ‘first’ and the remaining N — K colours 
the ‘second’; furthermore, we choose any K contours, which we consider as the 
‘first’ (and the other N — K contours as the ‘second’). In such case we have K! 
ways of painting K ‘first’? contours in the K ‘first’ colours and (N — K)! ways 
for colouring the remaining .V — K contours in the N — K ‘second’ colours. 


TIf Np is not an integer, then we.replace this number by the integer K that is closest to Kp: 
when N is large the difference between Np and K is negligibly small. A similar observation 
can also be made in relation to the number Ne, 
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By combining any of the K! ways of painting the K ‘first’ contours with any of 
the (N — K)! ways of colouring the remaining contours, we get altogether 


K! X (N — kK)! 


ways of colouring N contours in which the chosen KX ‘first’ contours are coloured 
in K ‘first’? colours. In addition, since the K ‘first? contours can be chosen from 


AN wre 
the total number N of contours in ( x) ways, the total number of distinct colour- 


ings must be 


(x) KG — WL 


Consequently, 
N= ( )Ktev— Ky 


implying also the desired equation 


(x) = a © 


The well-known equation (*) gives an exact expression for the number ( a 


in terms of the numbers N and K; however, for large N (and only the case of 
large N will be of interest to us in the following) it becomes inconvenient. The 
fact is that N! is the product of N distinct factors; an evaluation of its value for 
large N is rather complicated. Hence, in what follows we shall use not this 


; — , N ; : 
equation, but an approximnte estimate of the value of ( K ) This estimate 


differs from the right-hand side of (*) mainly in this that it includes only the 
powers of N, K and N — K, which are easy to evaluate by taking logarithms. 
The desired estimate of ( will be derived below. 

Let us consider the same problem of colouring N contours in N colours, which 
we used for deriving the formula (*), but we do not require now that each con- 
tour be necessarily coloured in its own colour. In this case, the first contour as 
before can be coloured in any N colours; however, the second, third, ..., and 
last contour can also be coloured in any N colours. Hence, the total number 
of colourings in this case is now given by the expression 


NXNX...XN=NN, 


-___—_.,-_. —__ 
N factors 
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If we now again choose any K ‘first’ colours and K ‘first’ contours, then these K 
contours can be painted in K colours in K*¥ ways. However, the remaining 
N— K ‘second’ contours can be painted in (N — K)*-* ways with N — K ‘second’ 
colours. By combining each of the possible K* paintings of ‘first’ contours 
with each of the (VN — K)*-* colourings of the remaining contours, we get 
altogether 


KE x (N — K)*-* 
different ways of painting all N contours. This number ought to be further 
multiplied by ( : ), since ( a is the number of ways in which K ‘first’ con- 
tours can be chosen from the total number of N contours. This yields the number 


(%) xa aye 


of different colourings. However, this number is found to be not equal to but 
less than the total number N% of possible colourings of N contours. In fact, 


( . )xsav — K)-¥ js the number of those colourings, in which K ‘first’ colours 


are used exactly K times (but, there exist also the colourings in which these K 
colours are used N times say, or are not used at all!). Thus, finally, we get 


( . )KR — K)N-K < NN, 


. 
This also yields the desired estimate of ( “ ) by 


( _ < KN Or Se 


Let us now replace K by Np in (**); this converts N — K into N — Np = 
N( — p) = Nq. Hence, we get the estimate 


N NN - NN 7 Ny 
Np ) ~ (Np)¥?(NQ)Wt ~ NNetWapNaga ~ “NNpNagna — ‘pNegNva 


for the number ( - ) of ‘most probable’ N-letter sequences of the letters a and b, 


i.e., the sequences in which the letter a is encountered exactly Np times (and 
the letter b the remaining Ng = N — Nptimes). Roughly, there are as many 
Sequences in which a occurs, say, Np + 1, Np + 2,..., Np + Ne times, or 
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Np — 1, Np — 2,..., Np — Ne times as those where a occurs exactly Np times 
(since in all these cases the deviation of the frequency of occurence of a from p 
is very small). Hence, without any risk of serious error, we can consider that 
the total number of ‘probable’ sequences (i.e., the sequences such that all other 
sequences taken together have very small probability, which can be neglected) 
does not exceed the value 


l € 
M, = 2Ne X pNeqne = haga ° 
where e is some small number. 

We now use the best uniform code for coding M, (or less than M,) probable 
sequences.t Since the number of such sequences is quite large, the code-word 
length practically coincides here with the binary logarithm of the number 
sequences (see, p. 106). Hence this code-word length is not greater than 


log M, = log 2c + log N — N(p log p + q log q). 


Consequently, the average number of binary digits per letter of message does 
not exceed here the value 


where 
H = —p log p -- q log @. 


As N — 0 the second and third terms on the right-hand side of the penultimate 
equation tend to zero (recall that the ratio log N/N = —(1/N) log (1/N) tends 
to zero as N — ov; see p. 47). This implies that, if we restrict ourselves to 
‘probable’ sequences, then the average number of binary digits per letter of a 
message can be made arbitrarily close to H.tT 

As regards the remaining ‘low-probability’ sequences, even if we use the number 
of binary symbols, which is several times greater than H, in coding each letter 
of these sequences, the average value of the number of such symbols needed 
per letter of a message remains here all the same almost invariant (since the 
probability sum of all such sequences is negligibly small), Hence in coding of 
the remaining sequences it is factually necessary just to take care that none of 
the corresponding code words coincides with the extension of any other code 


fIt is easy to see that the application of any nonuniform code to these ‘probable sequences’ 


may not confer any substantial advantage. This is due to the fact that probabilities of all such 
sequences differ only slightly from each other (since the relative frequencies of both the letters 
are here practically the same in all cases). 

{TOF course, this number cannot be less than H (see, p, 149). 
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word being used. This objective can be achieved, for instance, if right from the 
beginning, we add | to the total number of ‘probable’ sequences (the replacement 
of M, by M, + 1 obviously does not change any of the above estimates). Then 
we can make use of the fact that in such a case we certainly have at least one 
‘free’ code word of the same length as all the code words of ‘probable’ sequences. 
If we now prefix this ‘free’ code word to all code words of ‘low-probability’ se- 
quences, then it will be guaranteed that none of the new words is an extension of 
one of the old words. After this word, we can add (say) the result of applications 
to ‘low-probability’ sequences of any most efficient uniform code, after which 
finally for all ‘low-probability’ sequences code words of one and the same length 
will be obtained, satisfying the required condition. 

The general case of an n-letter alphabet, in which individual letters have prob- 
abilities p,, Po, ..., Pn Tespectively, where py + pp +... + pn = 1, is analyzed 
almost in the same way. In the case of a long sequence of N letters, the greatest 
probability will have a sequence in which the first, second, . . . , mth letter is en- 
countered nearly Np,, Npo,..., Np. times. The number of sequences in which 
the first, second, ..., nth letter occurs exactly Np,;, Np, ..., Npy times is 
equal to the number of partitions of a set of N elements into m subsets contain- 
ing respectively the Np,, Np, ..-., Npn elements. 

Let us now consider the problem of colouring N contours with N colours such 
that each colour is used only once. If we partition the colours into m groups 
containing, respectively, Np,, Np2, . . . , NPn colours, we can show, in complete 
analogy with the derivation of equation (*), that the number of such partitions 
of a set of N elements into n subsets is 


N! 
(Np,)! (Np2)! . . - (Npn)! 
This equation generalizes the ordinary equation for the number of combinations 
( ai If we consider further the problem of colouring N contours with N 
colours (as before, partitioned into ” groups, of which the first, second, ..., last 
contain Np,, Nps, ..., Npn colours) in which it is not specified that each colour 
is used only once, we can verify in a way similar to the derivation of inequality 


(**) that the number of partitions of a set of N elements into n subsets we are 
interested in is less than the number 


J 


N; N, Np, 
Pi Pp A Pa 


Applying this result to the ‘probable’ sequences, in which the frequency of 


+The derivation of this equation can also be found in [38]. 
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ccurrence of the first, second, .. . , nth letters lies, respectively, between p; — « 
and p, + €,p, —« and p,+e,...,pn—eandp, + e, we find that the total 
number of such sequences certainly does not exceed the number 


1 2°e"N® 
n —_—— OO. 2 on n= ce oe 
ie pri pNP: pNPn pNP pNPe | pNPn 


As to the remaining sequences in which the frequency of occurrence of even one 
of the letters is not contained within the stated limits, the probability sum of all 
these sequences is negligibly small, which permits us to ignore them altogether. 

It is now just routine to show that by encoding our all ‘probable’ sequences 
by means of a most different uniform code we arrive at code words, whose length 
is less than 


NH + nlog N + n log 2, 
where 
H = —p, log p; — p; log pz — ..- — Pn log pn. 


Consequently, the average number of binary symbols required to write one letter 
does. not exceed 


log N nm log 2e 
H+ 1 — — a a 


As N — oo this number obviously tends to H. Hence H is equal to the limit 
average number of binary symbols required per letter of a message in such cod- 
ing method. This also is just the result we sought to prove. 


Finally, it is worth while to reemphasize the main basis of the proof deduced. 
If we consider all sequences of N letters from an n-letter ‘alphabet’ (or equiva- 
lently, all sequences of N successive outcomes of a many times repeated experi- 
ment, which can have n different outcomes), then the total number of such dis- 
tinct sequences is 


nN = oN log n 


However, the probability of each such individual sequence and even of some 
appreciable collections of such sequences for large N is completely insignificant. 
It has been shown that, if we permit ourselves to exclude from consideration a 
part of the least probable sequences, but only such that the probability sum of 
all discarded sequences is sufficiently small (say, not exceeding a certain pre- 
assigned extremely small number 5), then for any (arbitrarily small!) 8 in the 
case of N being sufficiently large it is possible to obtain the result that the number 
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of remaining sequences has the order 


ia) a) ey 


where His an entropy.t Note also the fact that since H is less than log a (ex- 
cepting the case in which all letters or all outcomes are equally probable), the 
number of our ‘probable’ sequences for extremely large N is incomparably 
smaller than the total number of all sequences (the ratio 


ONE . oN logn _ 9—Nilog n—H) 


of the number of ‘probable’ sequences to the number of all sequences rapidly 
tends to zero as N ~ oo). It has also been shown that for large N it is possible 
to establish the fact that the relative frequencies of the occurrence of individual 
letters in our ‘probable’ sequences differ as little as desired from the most prob- 
able frequencies p,, P2,..., pn. Since the probability of a sequence depends 
only on the numbers of individual letters occurring in it (the probability of a 
sequence in which the first, second, ..., mth letters occur N,, No,-.., Na 
times is pi 'p,*... pNny, hence it is clear that for large N one can See that all 
‘probable’ sequences differ very little in their probabilities. In other words, we 
have proved here the statement set in italics on p. 56; this statement determines 
the main part of the notion of entropy in coding theory. 


In view of the specific importance of the the statement brought out, it makes 
sense to dwell upon it slightly longer and derive one more simple proof of it. 
In the foregoing, we based our arguments on the calculation of the total number 
of N-letter sequences in which the frequencies of individual alphabet letters differ 
little from the corresponding probabilities p,, p,,..., Pn. In this connection, 
it was also noted that the probabilities of all such sequences are close to each 
other and -for all practical purposes do not deviate from the probability 


pNPipNPs yp NPo of sequences in which N, = Np,, Na = Np2,..-, Nn = Nas 


{The phrase ‘has the order’ implies here that in fact before 2V there may occur a certain 
factor proportional to the finite degree of N (that is, proportional to 24 108, where A is a 
fixed number); clearly, when N is quite large, this factor is very much less than the basic factor 
2N# and does not play an essential role. Note in this connection that in the derivation above 
we have shown just that the number of ‘probable’ sequences does not exceed (2e)"N"2N#7_ It 
is, however, clear that this number is not less than the number of sequences containing the first, 
sccond,..., ath letter exactly Np,, Nps,..., Np, times. It has been shown above that the 
last number is necessarily greater than 


= ONE 


pNP: pNPa |. pNPn 


Thus, to within a factor of the order of the finite degree of N the numter of ‘probable’ se- 
quences coincides with the numter 2’, 
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i.e., the frequencies of occurrence of each of the m alphabet letters precisely coin- 
cide with the probabilities p,, pe,..., Pn. The preceding probability can ob- 
viously be rewritten in the form 


(glog Piy¥1(gloe pay8?4 | (2108 PnyN?n__ 9N(P1 log pr + PalOB Pat -++ + PrlOB Pn) 2-H 


Since H = —p, log p, — p2 log pe — ... ~ Pn log pn is a fixed finite number and 
N is very large, it is clear that the probability 2-4” is quile small. Let us now 
note that the formula derived immediately implies the estimate required by us of 
the total number of different ‘probable’ sequences. In fact, the probability sum 
of all such sequences is quite close to unity (it differs from unity by just some 
extremely small number); since the probability of the sum of incompatible events 
is equal to the sum of the corresponding probabilities, is is clear that the total 
number of the considered sequences must be close to unity divided by the prob- 
abilities of individual sequences, i.e., close to the number 2”, Thus, the state- 
ment we are interested in is proved if we can just show that in a collection of 
all mn" possible N-letter sequences it is possible to discard some collection of 
‘low-probability’ sequences (whose probability sum for sufficiently large N can 
be made as small as desired) so that all the remaining sequences have practically 
the same probability 2-47. 
Now we can easily evaluate the probability of any sequence of N letters of an 
n-letter alphabet (where the probabilities of first, second, ..., nth letters are, 
respectively, p,, Po, -. - » pn), if these sequences are such that N letters are chosen 
successively one after the other independently of those chosen previously. This 
probability obviously equals the product p,, p;,+- - Pry, Where ij, i2,...,iy are 
the numbers of successive letters of our sequence. Consequently, the logarithm 
of this probability is given by the relation 


log p;, + log p,, +... + log p, 
log P,, ae log p;, + weet Gi. =e x N. 


But the variables log p,,, log p,,,... , log p;,, are all defined by the results of 
experiments consisting of the choice of one of the alphabet letters. Hence 
these are all random variables, which can take n values log py, log po, ... , log Pa 
with probabilities p,, ps,..., Pn» respectively. By applying the law of large 
numbers proved on pp. 34-36 to such a random variable, we find that with a 
probability, which for sufficiently large N can be considered as arbitrarily close 
to unity, the arithmetic mean 


log p;, + logp,, +... + log py 
N 


differs from 


m.v. log p = P, log p, + pe log pa +... -+ pnlog pp = —H 
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by not more than a given very small number ce. But this also implies that among 
the number of all N-letter sequences it is possible to disregard some collection of 
‘low-probability’ sequences of very small probability sum such that the probability 
of all the rest of the sequences remains roughly the’same and extremely close to 
2-4N, The last statement is also the one which we desired to prove. 

Let us further sketch briefly the role of the assumption specifying that the 
successive letters of a message are chosen each time independently of all the pre- 
ceding letters. On pp. 161-62 it has been shown that first proof of the fundament- 
al coding theorem considered does not in fact depend on the fulfilment of this 
condition. However, in the general case of mutually dependent letters the value 


of the entropy H of one letter must be replaced by the per letter specific entropy 
(N) 


He. = lim i (where H is the entropy of a block of N letters). Starting 
N>o 


from this it seems natural to expect that the second proof must in fact be appli- 
cable also to the general case of a message with mutually dependent letters, 
although in the course of this proof the assumption of independence of the letters 
of a message is essentially used. In other words, it seems natural to expect that 
even in the case of a message whose letters depend on each other among all N- 
letter sequences, where N is sufficiently large, one can extract a collection of 
‘probable’ sequences, whose probability sum differs very little from unity, the 
number of these probable sequences being of the order 2HoN = 24™) and the 
probability of each of them being close to 2-HooN = 2-4""’, The statement set in 
italics occupies a very important place in information theory; however, its proof 
is not quite straightforward and, moreover, it cannot be obtained in general for 
all cases without exception since it demands that the probability distributions 
for successive letters of a message Satisfy certain additional conditions. (These 
additional conditions are of a great variety and are always accomplished in 
practice, but even their formulation entails the introduction of several quite new 
and nonelementary probabilistic notions.) Note also that these additional condi- 
tions can be chosen in different ways: thus, for one such condition the statement 
made above was proved by Shannon ([21)], Theorem 3), while later on entirely 
different, quite general conditions for its validity were specified by McMillan [68]. 
We shall not further elaborate on this aspect and, instead, we refer the reader 
to [8], [9], [11] and [23j, in which the subject is analyzed in great details. 


All the preceding arguments of this section easily carry over also to the case 
of m-ary codes employing m elementary signals. Thus, (say) for constructing 
m-ary Shannon-Fano codes it is required only to partition groups of symbols 
not into two but into m parts of closest possible probability. Similarly, for 
constructing m-ary Huffman code it is necessary to use the contraction opera- 
tion of the alphabet, in which each time we combine not two but m letters of 
the original alphabet, having the lowest probabilities. In view of the importance 
of the Huffman code, we deal with the last question in slightly more detail. The 
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contraction of an alphabet, in which m letters are replaced by one, clearly reduces 
the number of letters by m — 1. The obvious prescription for the construction of 
m-ary codes is that the sequence of ‘contractions’ finally leads us to an alphabet 
of m letters (associated with m code signals), and hence it is necessary that the 
number n of original alphabet letters be represented in the form 


n= m+ k(m — 1), 


where k is an integer. This, however, can always be achieved by adding, if 
required, to the original alphabet a few ‘fictitious letters’, whose probabilities are 
considered to be zero. Then the construction of an m-ary Huffman code and 
the proof of its optimality (among all m-ary codes) are carried out in exactly 
the same way as in the case of a binary code. Thus, for instance, in the case of 
the 6-letter alphabet considered above, having the probabilities 0.4, 0.2, 0.2, 
0.1, 0.05 and 0.05, for the construction of a ternary Huffman code it is required 
to affix to our alphabet one additional fictitious letter of zero probability and 
act further as indicated in the accompanying table. 


No. of Probabilities and code words 

letters Original alphabet Contracted alphabets 
1 0.4 0 0.4 0 0.4 0 
2 0.2 2 0.2 2 —>0.4 1 
3 0.2 10 02 10 | 02 2 
4 0.1 1] 0.1 lt |j<—— 
5 0.05 120 \—->0.1 12 
6 0.05 121 j<— 
7 0 — 


Both proofs of the fundamental coding theorem deduced above carry over to 
the case of m-ary codes in a straightforward manner. In particular, the corres- 
ponding modification of the first proof is based on the fact that any n numbers 
L, le, ... In, satisfying the inequality 


1 1 
ae 


1 
+... toy S41, (2) 


form the code-word lengths of some m-ary code for an n-letter alphabet. The 
proof of this fact is precisely a reiteration of the arguments deduced on pp. 159- 
60 for the case of m = 2; hence, we need not dwell upon it here. Using in- 
equality (2) in the same way as inequality (1) on p. 159, it is easy to obtain the 
following result (called the fundamental coding theorem for m-ary codes): in any 
coding method, using an m-ary cude, the average number of elementary signals 
per letter of a message can never be less than the ratio H/log m (where H is 


4,2. SHANNON-FANO AND HUFFMAN CODES 173 


the entropy of one letter of the message); however, the former can always be made 
as close as desired to the latter quantity, if sufficiently long N-letter ‘blocks’ are 
coded directly and not the letters. Hence, it is clear that if L elementary signals 
(taking m distinct values) can be transmitted through a communication channel 
in unit time, then the information transmission rate over such a channel cannot be 
greater than 


v= Liem letters/unit time; 


the transmission at a rate as close as desired to v (but less than v!) is possible, 
however. The variable 


C =Llogm, 


appearing in the numerator of the expression for v, depends only on the com- 
munication channel itself (while the denominator H characterizes the message 
to be transmitted). This variable defines the greatest amount of information 
units that can be transmitted over our channel in unit time (because one element- 
ary signal, as we know, can contain at most log m units of information); it is 
called the channel capacity. The notion of channel capacity occupies an import- 
ant place in communication theory; we shall come back to this later also (see 
Sections 4.3.6 (pp. 246-51) and 4.4). 


We offer one more remark related to the first proof of the fundamental coding theorem 
derived on p. 158 et seq. The fact of the existence of a binary code plays a central role in this 
proof, in which the code-word length /; of the ith letter satisfies the inequalities 


—log pi <1; < —logp; + 1, (A) 
or, equivalently, 


1 1 
a << ra . (B) 


In the case of an arbitrary m-ary code these inequalities assume the form 


log Pi log P: 
~ Jog m Shs = log m a (AY 
or, equivalently, 
1 B’ 
ma S PLS Fy (B’) 


The existence of a binary code satisfying (A) and (B) is proved above, relying on inequality 
(1) on p. 159 but the explicit expressions of the code words are not set out in the proof. In 
the case of an m-ary code, in exactly the same way, inequality (2) on p. 172 can be used. We 
now describe a method for the explicit construction of the corresponding code words. For 
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simplicity, we shall confine ourselves here to the case of a decimal code, associating some 
sequences of digits 0, 1,..., 9 with each of the a-alphabet Jetters.t For such a decimal code, 
the inequalities (A’) and (B’) obviously assume the form 


—Igp;<i < —Igpi +1 (A’) 


(where lg indicates common decimal logarithms!), and 


1 1 
Tow SP! S Tour a 


Arrange the whole ‘alphabet’ in the order of decreasing probabilities p, > p, > py >.. 
> Pn. Among these probabilities, we may obviously encounter even identical ones; hence the 
probability by itself cannot uniquely characterize the corresponding letters. If, however, we set 
up the sums 


P,=0, P2= pi, Ps = Pi + Pr, Py =Pit P2tPy,..-, Pho=Pit Pot---+ Paw 


then these sums are plainly all distinct. Thus, the 1 numbers P,, Pa, ..., Py can be considered 
as a distinctive ‘alphabet’, corresponding uniquely to the original n-letter alphabet. We are 
now required only to encode the new ‘alphabet’, i.e., to associate a definite sequence of 
elementary signals (or digits) with each of the m numbers P;. Such coding solves also simulta- 
neously the problem of coding the original alphabet. 

It is not difficult to indicate a method for solving the problem of coding the number set 
Py, Po,..., Py. Let us represent each of the numbers P; (less than unity!) in the form of a 
(in general an infinite) decimal fraction: 


P; = 0.a\a,4,...0,..., 


where 4), @2, @,,... are any digits (if P; is expressed in the form of a finite decimal fraction, 
then all digits a, from a certain digit onwards are zero). Every P; is, in turn, associated with 
the infinite sequence a,a,a, . . . of digits (i.e., of elementary signals); here the m sequences of 
digits so obtained are obviously all distinct because no two P; are equal to each other. 

Now note that the distinction between the introduced sequences a,a,a,...cannot be man- 
ifested only in digits which are quite far away from the initial digit. In fact, it is obvious that 


Peay — Pp = Di, Peta — Pi = Pi + Pit--- 


Hence, by inequality (B”) all numbers P;,,, Pise,.--, Pn differ from P; by not less than 
1/10% and therefore the expansions of all these numbers into decimal fraction differ from the 


{The general case differs from this mainly in that it entails the expansion of the numbers P; 
appearing below into an (infinite) m-ary fraction, i.e., the representation of each number P; in 
form of the sum 


ie toa ke Fae > 
where all ‘digits’ a,, a,,..., a,,...in the formation of this fraction assume any of the values 
0,1,...,— 1. Werecommend that the reader undertake the related construction as aq 


independent exercise. 
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decimal fraction expansion of the number P; in the /;th, or even preceding to /;th, digit. In 
other words, all decimal fractions of Pii1, Pi+a,.- +», Py differ from the decimal fraction of P; in 
at least one of the first |; digits, Hence, if we leave out just the first /; digits in the decimal 
expansion of P; (where i = 1,2,...,), then we obtain n (finite!) decimal fractions, which 
are all distinct and none of which is a prefix of the other. The corresponding n sequences 
A,G,0,.. . a,, of digits (associated with the # letters of the original alphabet) form the required 
decimal code. 


It is shown above that any n numbers 1, ly, .. . 5 ln such that 


1 1 

ar tart: tam <b (2) 
form the code-word lengths of some m-ary code, which associates n letters of the alphabet with n 
sequences of elementary signals, taking m possible values. Setting the corresponding arguments 
in the converse order, it is straightforward to show also that the code-word lengths 1,, l,,... ln 
of any m-ary code for ann-letter alphabet necessarily satisfy inequality (2). This has already 
been factually established at the end of the previous chapter (see pp. 135-36), albeit without 
using the terminology of this chapter. Thus, it is necessary and sufficient that inequality (2) be 
satisfied in order that the numbers |, l,,...,1, be able to form the code-word lengths of some 
m-ary code. This statement was first proved in 1949 by the American scientist Kraft in his 
unpublished dissertation (see, for example [8] and [1]), and later it was further extended by 
McMillan [69]; hence, inequality (2) is often called the Kraft inequality or McMillan inequality. 
The generalization due to McMillan is connected with the circumstance that so far we have 
considered only codes satisfying the general condition set in italics atop p. 141 (and terme4 them 
instantaneous or instantaneously decodable in the footnote on the same page); it is only to these 
codes that all the arguments deduced above are related. McMillan has shown, however, that 
condition (2) is necessary and sufficient also for the existence of a uniquely decipherable (but not 
necessarily instantaneous) m-ary code with code-word lengths 1,, l,,...,1,. Since any instant- 
aneous code is at the same time also uniquely decipherable, it is obviously required to prove 
only the necessity of the stated inequalily for any uniquely decipherable code, i.e., the fact that 
in the case of any uniquely decipherable m-ary code for an n-letter alphabet the code-word 
lengths 1,, l,,...,1, necessarily satisfy inequality (2). The last statement has been proved in 
a most straightforward manner by Karush [67], whose proof we shail also follow in our 
presentation. 

Denote by A the sum 


1 1 
wml 


m'na 


where /,, /,,..., 4, are code-word lengths of some uniquely decipherable m-ary code assoc- 
iated with the n-letter alphabet. Let us now set up the expression 


I 1 1 \é 
A Ger ge te) 


1 1 1 1 1 1 
mi + Tal tet ae Gar tae +++ om) 
-— a ay el i pee ee 


t times 


’ 


1 ! 1 
ear wast ais t+ + az): 
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Removing the parantheses in the last product, we obtain the sum of nt terms of the form 
1j/mN, where each exponent N is equal to one of the sums of the form /;, + Ij, +... + diy. 
The numbers /,, i,,..., i: take here the values 1, 2,..., mand of course they need not be all 
distinct. If it is assumed that the lengths of m code-words for a uniquely decipherable m-ary 
code are so ordered that! <1, <4, <... < /,, then the two inequalities 


t<cN cal, 
hold for every sum 
N=lya th, t+... + hy. 


In fact, it is clear that N= tif ],,=1, =...=4,=1,and N= tif], =l,=...= 
Li, = [y. Now denote by Ky the number of distinct sums /;, + J;, + ... + di, taking the 
value N. It is then easy to see that by removing the parantheses in the expression for A‘, we 
get 


1 1 1 t 1 1 1 
4=(Setgatector = Ki + Kin Sa +--+ + Kin Ga 
where, of course, some of the coefficients K;, K,11,..., Ky, can take zero values. Now note 


that the number Ky of distinct sums /;, + Jj, +... + lip taking the value N, is equal to the 
number of distinct ¢-letter words b;,b;, ... b;, (where b,, b,,..., 6, are our alphabet letters) 
to be encoded by a sequence of N elementary signals. It is easy to show that 


Ky < mN 


for any uniquely decipherable code. Indeed, m is the total number of distinct sequences of 
N signals, each of which can take one of the m values, and if any two distinct words were en- 
coded by the same sequence of elementary signals, then this would imply that the code is not 
uniquely decipherable. Hence for any (natural) ¢ 


1 1 
At = Ki ay + Ken Sar 


t+1 


1 
+... + Ku, Sar 


< mt + mt+t +...4- mtn = = th, —(t—1) < thy. 


mtt1 
But this also implies that 


A<l 


(i.e., the inequality (2) holds!). In fact, for any A > 1 the variable 4¢ increases with increasing 
¢ faster than cf, where c is an arbitrary fixed numbert (say, /,,), and hence for sufficiently large 
t the inequality At > 1,4 is necessarily satisfied. 

From the fact that for both the instantaneous code and any uniquely decipherable code 
the necessary and sufficient conditions for the existence of a code with given code-word lengths 


{Denote | /t by p, then, log (At) = tlog A = log A/p, and 
log (ct) = loge + log t = loge — logp. 


It is clear that when p is small (i.e., when ¢ is large) the first of these numbers is considerably 
greater than the second, because log c¢ is a constant number (independent of p), log A > 0 
(since A > 1), but the ratio (—log p) : [(log A)/p] = (1/log A) (—p log p) vanishes as p > 9 
(see p. 47). 
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1, 4,,...,%, has one and the same form (2), it follows that for any uniquely decipherable m- 
ary code there exists an instantaneous code having the same code-word length of letters as in the 
case of the original unigely decipherable code. But this in turn implies in particular that Huffman 
codes are optimal (i.e., have the least average code-word iength per letter) not only among al] 
instantaneous codes (this fact is shown on pp. 156-158. see, also p. 172), but also in general 
among all uniquely decipherable codes. 


4.3. Entropy and information of various messages encountered in practice 


The preceding two sections were devoted to the problem of the coding and trans- 
mission of an abstract ‘message’ written in any ‘language’ whose alphabet con- 
sists of m letters. We shall now discuss the conclusions that can be derived in 
relation to specific types of messages used in human communications—in the 
first place messages expressed in the English language or in some foreign langu- 
ages. There exists extensive literature on this subject (see, for example, [1], 
[5], (6), [17], [147], [148], [173] and [174], which will be reviewed only partly 
below). : 


4.3.1. Written Language 


The basic result of Sec. 4.1 is related to an M-letter message transmission: 
(where M is sufficiently large) over a communication channel admitting m dis- 
tinct elementary signals. The result states that for such transmission it is neces- 
sary to send not less than M log n/log m signals, where n is the number of differ- 
ent ‘alphabet’ letters by means of which the message is written; moreover, there. 

_ exists a coding method which enables us to approach as closely as desired the 
indicated bound of M log n/log m signals. Since the English ‘telegraphic’ alpha- 
bet contains 27 Jetters (the 26 ordinary English letters and also the ‘zero letter’-— 
the space between words), hence for the transmission of an M-letter message 
composed of English words, it is necessary to send 


log 27 =—M H, 
log m log m 


M 


elementary signals. Here 


Ho = log 27 = 4.75 bits 


is the entropy of an experiment that consists of receiving one letter of the English 
text (the information contained in one letter), subject to the condition that all 
letters are considered equally probable. 

In real life, however, the appearance of different letters in an English language 
text is far from being equally probable. Thus, for instance, in any text the letters 
E and T occur more frequently than Q or J; since the average word-length in 
the English language is considerably less than 26 letters, the probability of the 
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occurrence of a space (‘zero letter’) by far exceeds the value 1/27, which we 
would have obtained if all 27 letters were equally probable. Hence, the infor- 
mation contained in one letter of any intelligible English text is always less than 
log 27 (= 42 bits). This implies that it is impossible to produce a text com- 
posed of English letters, in which each letter contains log 27 bits of information 
by just taking an excerpt from some English book. To achieve this, it is neces- 
sary to write 27 letters on separate cards; place all these cards in an urn and 
then draw them one by one, each time writing down the letter drawn and replac- 
ing the card in the urn and mixing well the contents of the urn. Carrying out 
such an experiment, we arrive at a ‘sentence’ which looks as follows (cf. 
Shannon [21]): 


XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD 
QPAAMKBZAACIBZLHJQD 


This text may be called a ‘zero-order letter approximation’ to English. Though 
it is made up of English letters, it has obviously little in common with the English 
language. 

For a more accurate calculation of the information contained in one letter of 
an English text, it is necessary to know the probabilities of the occurrence of 
different letters. These probabilities can be determined approximately by taking 
a sufficiently large excerpt written in English and calculating for it the relative 
frequencies of individual letters. Strictly speaking, these frequencies may depend 
also upon the character of the text and the singularities of the style of the in- 
dividual author. For example, it seems plausible that in some scientific books 
the frequencies of individual letters undergo a change due to the appearance of 
many special terms and foreign words. Even greater deviations from usual 
letter frequencies can be found sometimes in poetry, or in some refined fiction 
work. A striking example of the latter is provided in [17, Chap. 3}: it is related 
to the 267-page novel Gadsby by the American author Ernest Vincent Wright 
published in 1939, which does not contain anywhere the letter E (ordinar'ly the 
most frequently used letter in English alphabet!). Some other peculiar examples 
of this sort related to the German and Portuguese literature are also listed in 
[17]. Hence for the reliable determination of the ‘average frequency’ of letters it 
is desirable to have a collection of different texts taken from different sources. 
As a rule, however, the deviation from the ‘normal letter frequencies’ is never- 
theless comparatively small and it can be ignored in a first approximation. 
Approximate values of the frequencies of individual’ English letters are listed in 
the accompanying table (see, for example, Shannon [159}, Quastler in [19], Pierce 


tLet us, for instance, mention the poem ‘Rush’ by the Russian poet K. D. Balmant. ‘In this 
poem, the rustling of rushes is described by the repeated appearances of the (usually quite 
infrequent) Russian hissing letters WJ (sh) and ¥ (ch). 
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[17], Abramson [I], Reza [152], Pratt [149], which contain slightly different 
numerical data; the space between words is denoted here by a dash). 


TABLE 
Letter Relative frequency Letter Relative frequency 

_ 0.182 m 0.021 
e 0.107 7 0.020 
t 0.086 g 0.016 
a 0.067 y 0.016 
o 0.065 P 0.016 
n 0.058 w 0.013 
r 0.056 b 0.012 
i 0.052 v 0.007 
5 0.050 k 0.003 
h 0.043 x 0.001 
d 0.031 i 0.001 
f 0.028 q 0.001 
f 0.024 z 0.001 
¢ 0.023 


Equating these frequencies to the probabilities of the occurrence of the corres- 
ponding letters, the approximate valuet of the entropy of the English text letter 
is given by 


H, = H(a,) = —0.182 log 0.182 — 0.107 log 0.107 — 0.086 log 0.086 
—...— 0.001 log 0.001 
= 4.03 bits. 


From a comparison of this value with H, = log 27 = 4.75 bits it is seen that 
the irregularity in the occurrence of different letters of the alphabet leads to a 


tSince the values of the frequencies of individual leiters in an excerpt containing a finite 
number N of letters do not coincide with the corresponding probabilities, it is clear that the 
value of the entropy obtained by substituting probabilities for frequencies is not exact. An 
estimation of the accuracy of the values of the entropies thus obtained and corrections involv- 
ed to these values when Nis not large enough have been considered, for instance, by Basharin 
in [74] and Miller in (19, pp. 95-100]. See also Blyth [79], Pfaffelhuber [144] and Nemetz 


[133}, 
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reduction in the information contained in one English text letter by roughly 0.72 
bit. 

Making use of this fact, we can reduce the number of elementary signals re- 
quired for the transmission of an English M-letter message to the value 
M(H;,/log m) (i.e., in the case of a binary code to the value H,M = 4.03 M). 
A reduction in the required number of elementary signals can be achieved by 
coding individual English alphabet letters by the Shannon-Fano method (see 
p. 150 et seq.). It is not difficult to verify that the application of this method 
leads to the accompanying table of code words. 


TABLE 


Letter Code word Letter Code word Letter Code word 
— 111 0101 r 0110 
a 1001 0000000010 RY 01001 
b 000001 00000001 t 101 
¢ 00100! 00110 u 00011 
ad OO1I! 001000 v 0000001 
O11 w 000010 


1000 


The average number of elementary signals required for the transmission of 
one letter of a message under such a coding method is given by 


0.375 X 3+ 0.298 x 440.196 X 5+ 0.117 X 6 + 0.007 x 7+ 0.003 x 8 
+ 0.004 x 10 = 4.1], 


ie., it is considerably lower than Hy ~ 4.75 and does not differ sharply from 
H, = 4.03.7 


Besides, it is rather difficult to decipher a message encoded by such a method, and this 
renders this code of little practical value. The difficulty in deciphering can be verified, for 
instance, by attempting to decode the following ‘sentence’ : 

101010001 1011 10100101000100101 11011) 100001 111110010110010111 1000111001001 100000 

11111011101010100111 1001 1 10101001010010101010010010001 1001 10101 1111011000 
11100111110001001 1000001 11.110 


(Decoding is facilitated appreciably if we set up beforehand all code words in the order of 
decreasing probabilities of corresponding letters.) 
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The average number of elementary signals per letter of a message to be trans- 
mitted, even when it equals the value H,/log m, is not the best, however. In 
fact, in defining the entropy H, = H(«,) of experiment «,, consisting of deter- 
mining one letter of an English text, we had considered all letters to be indepen- 
dent. This means that for making up a ‘text’ in which every letter contains 
H, = 4.03 bits of information, we must use an urn containing 1,000 well-mixed 
tickets, of which nothing is written on 182, the letter e is written on 107, on 86 
the letter ¢,..., and, finally, on 1 ticket the letter z is written (see the frequency 
table of English letters on p. 179). By drawing the tickets from this urn one by 
one we may arrive at a ‘sentence’ that looks like the following :f 


OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI 
ALHENHTTPA OOBTTVA NAN BRL 


This ‘first-order letter approximation’ to English is somewhat more akin to 
- intelligible written English than its predecessor (we observe here at least a plaus- 
ible distribution of the number of vowels and consonants and the mean word 
length is close to the average word length of English language), but it is ob- 
viously still far from being a reasonable text. 

The dissimilarity of our sentence from an intelligible text is naturally explained 
by the fact that in real life the successive letters of an English text are not at all 
independent of each other. Thus, for example, the letter Q in English is always 
followed by U (so that the combinations QA, QB or QX, for example, have a 

zero probability); T is most often followed by H (TH occurs most frequently of 
all two-letter combinations or digrams in the English language); similarly, the 
letters O and W are most often followed by R and E, respectively; the prob- 
ability of the occurrence of a vowel after a consonant is significantly higher than 
the probability of its occurrence after another vowel, and so on. The existence 
of such auxiliary regularities in the English language, for which no allowance is 
made in our ‘sentence’, leads to a further reduction in the amount of uncertainty 
(entropy) of one letter of the English text. Hence, in the transmission of such 
a text over a communication channel, we can still reduce the average number of 
elementary signals required to transmit one letter. It is not difficult to compre- 
hend how this reduction can be characterized numerically. For this it is neces- 
sary only to calculate the conditional entropy H, = Hz,(%.) of experiment a, 
that consists of the determination of one letter of the English text, given that we 
know the outcome of experiment a: that consists of the determination of the 
preceding letter of the same text. (Note that when the next letter of a message 
‘is received, we always know already the preceding letter.) By what has been 
stated on pp. 62-63, the conditional entropy H, is defined by the formula 


See Shannon [21] (cf. also Dobrushin [91]). As explained in these papers, instead of draw- 
ing from an urn with 1,000 tickets we can undertake a considerably easier procedure, viz. we 
take any English book and choose from it a series of letters at random. 
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Ha, (%2) = H(a,%2) a H(a;) 

= —p(— —) log p(— —) —p(— a) log p(— a) 
—p(— b) log p(— b) — .. . — p(zz) log p(zz) 
+ p(—) log p(—) + p(a) log p(a) 

+ p(b) log p(b) + ... + p(z) log p(z), 


Ay 


where we denote by p(—), p(a), p(b), ... , p(z) the probabilities (frequencies of 
individual letters of the English language (their values are indicated on p. 179), 
and by p(— —), p(— a), p(— 5), .. . , p(zz) the probabilities (frequencies) of all 
possible digrams, i.e., two-letter combinations. Probability tables of such 
digrams in English texts, computed for the purpose of cryptoanalysis (i.e., for 
deciphering the encoded messages), are available (see for example, Pratt [149]). 
For an approximate determination of such ‘digram probabilities’ it is only 
necessary to calculate the frequencies of the appearance of different combina- 
tions of two adjoining letters in any sufficiently long English excerpt; in doing 
so, it is obviously possible to assume in advance that the probabilities p(—- —), 
P(qa) and a series of others (say, p(xx), p( jj), p(qx) and so on) are zero. The 
numerical value of H, will be given later in this section. Here we only empha- 
size that, by virtue of the results of Section 2.2, we can be convinced of the fact 
that the value of the conditional entropy Hz = Ha,(x.) is Jess than that of the 
unconditional entropy Hj. 

The quantity H, can be described as the ‘average information’ contained in 
the definition of the outcome of the following experiment. Let us assume that 
there are 27 urns to denote 27 letters of the English alphabet and that each urn 
contains tickets on which are written different digrams (i.e., two-letter comb- 
inations) starting with the letter denoting the urn. Suppose that the number of 
tickets in the urn with a specific digram is proportional to the frequency (prob- 
ability) of the corresponding digram. The experiment consists of repeatedly 
drawing tickets from the urns and writing out the last letters obtained from 
them. In this process, each time (starting with the second one) the tickets are 
drawn from that urn which contains the digram beginning with the last letter 
written out; after the letter is noted, the ticket is replaced in the urn from which 
it was drawn and the urn contents are thoroughly shuffled. (Instead of urns, 
it is also possible to make use of any English book, starting each time with a 
randomly chosen place, to seek the first appearance of the last letter chosen by 
us and add the letter that follows it to the already existing text. It is clear that 
such book experiment is much easier to perform than the corresponding urn 
experiment.) An experiment of this sort leads to a ‘sentence’ that looks like the 


following: 


ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY 
ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO 
TIZIN ANDY TOBE SEACE CTISBE 
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Phonetically, this ‘second-order letter approximation’ is appreciably closer to 
the English language than its predecessor set forth on p. 181 (for instance, here 
we have not only a plausible correlation between the numbers of vowel and con- 
sonant letters but are also nearer to their usual alternation sequence, because of 
which the sentence can be ‘pronounced’, though not without difficulty). We 
may also point out that there are several ‘genuine’ English words in this sentence 
(e.g., ON, ARE, BE, AT), but the previous example contains no such word. 

Thus, it is obvious that the quantity H,/log m also fails to yield the best poss- 
ible estimate of the minimum value of the average number of elementary signals 
required for the transmission of one letter of the English text. The fact is that 
in the English language (and for that matter in any other language) each letter 
depends not only on a letter that immediately precedes it but also on a series of 
preceding letters. For instance, it is known that the three-letter combination 
(trigram) THE quite frequently appears in the English language (and even a five- 
letter combination—7HE—is rather probable), but the trigram THA is practi- 
cally impossible. It is also known that after two consonants a vowel follows 
much more frequently than a third consonant and after two vowels a consonant 
is almost obligatory and so on. Hence, the knowledge of two preceding letters 
reduces still further the uncertainty of the event consisting of the determination 
of the succeeding letter, which is revealed in the difference H, — H, being posi- 
tive, where H, is the ‘conditional second-order entropy’ defined by 


HH; = Hay a2(%3) = H(a,%%3) — H(a,%2) 
= —p(——-—) log p(— — —)—p(— —a) log p(— —a) —.. .—p(zzz) log p(zzz) 
+ p(——) log p(— —) + p(— a) log p(— a) +... + p(zz) log p(zz). 


For the probabilities of trigrams in English texts see Pratt [149], for example, 
and the corresponding value of H, shall be given below. 

An intuitive corroboration of what has been stated is provided by the situa- 
tion in which an experiment, consisting of the draw of cards with three-letter 
combinations from 277 = 729 urns, each of which contains cards with different 
trigrams starting from one and the same digram (equivalently, an experiment 
with an English book in which repeated efforts are made to select at random 
the digram coinciding with the last two letters chosen beforehand and to write 
down the letter appearing after it), leads to a ‘sentence’ such as the following: 


IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID 
PONDENOME OF DEMONSTURES OF THE REPTAGIN IS 
REGOACTIONA OF CRE 


This sentence represents ‘third-order approximation’ to English. It is also closer 
to the English |anguage than its predecessors; it contains eight ‘genuine’ English 
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words and several easily pronounceable English sounding words (e.g., ‘DE- 
MONSTURES’). In analogy to this, we can also determine the entropy 


Hy = Hayayag(%q) = H(a,%2%3%,) — H(0,%2%y) 
= —p(———-—) log p(-———) —p(——— a) log p(——— a) 
— ... —p(22zz) log p(zzzz) + p(— ——) log p(———) 
+ p(—— a) log p(— — a) +... + p(zzz) log p(zzz) 


that corresponds to an experiment to determine the next letter of an English 
text, provided that the three preceding letters are known. Corresponding to this 
quantity, an experiment that consists of drawing cards from 27° urns with four- 
letter combinations (or, an experiment with an English book similar to the one 
described above) leads to a ‘sentence’, which would contain mostly genuine 
English or English-like words. A still better approximation to the entropy of 
letters of an intelligible English text is given by the quantities 


Hy = Hayy... ays(an) = H(01%... an) — H( 0,0, ... w-1) 


when N = 5, 6,.. . It is easy to see that with the growth of N the entropy Hw 
can only decrease (see p. 91). If it is further noted that all of Hw are positive, 
then from this it can be deduced that the quantities Ha,a,...ay_:(4v) = Hw 
tend to a definite limit H.. as N — co. This limit coincides with the limit H.. 
described in the preceding section (see p. 162). 


{The equality of the limit 
1 Hi) =A H(a,) + Hy,(ta) +... + Ha, ... ay-.(en) 
N-> o N N>o N 


considered in Section 4.2 to the quantity Hoo introduced here follows from the fact that for 
large N almost all terms in the numerator of the fraction H‘"’/N are close to 


Ax. = lim ,: Pan Bante oy (%N)s 
>o 


the only exception is the first few terms whose contribution to the total sum for N quite large 
is insignificant. 

Thus, the sequence of ‘specific entropies’ hy = H')/N as well as that of ‘conditional en- 
tropies’ 


Hy = Aga, woe ay_y(*N) 


converge to one and the same limit Ho as N-> @. Also, kh, = H, = H(a,), but Hy < hy 
when N > 1 (since Ay equals the arithmetic average of N numbers, only the last of which is 
equal to but the rest are greater than Hy). Hence the quantities Hy, N = 1, 2,3,..., 
approach the limit value Ho. appreciably more rapidly than the quantities Ay (cf. footnote on 
p. 239). 
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From the results of Section 4.2 it follows that the average number of elementary 
signals required for the transmission of one letter of an English text cannot be less 
than H../log m; on the other hand, a coding is possible for which this average 
number is arbitrarily close to the quantity H../log m (see p. 162). The difference 
R = 1 — (A../ M9), expressing how much less than unity is the ratio of the ‘limit 
entropy’ H.. to the quantity Hy = log n, the latter characterizing the greatest 
amount of information that can be contained in one letter of an alphabet with 
a given number of letters, was designated by Shannon as the redundancy of a 
language (English in the case under consideration). The data, of which we shall 
speak below, compel us to assume that the redundancy of the English language 
(as also that of other European languages) appreciably exceeds 50%. Without 
claiming precision, we can say that the choice of the succeeding letter of an 
intelligible text is determined in more than 50% cases by the very structure of 
the language and, consequently, randomness is involved only to a comparatively 
small extent. It is specifically the redundancy of a language that enables us to 
contract the telegraphic language by discarding some words (articles, prepositions 
and conjunctions) that are easy to guess; it also allows us to reconstruct easily 
the true text even in the presence of a considerable number of errors in a telegram 
or misprints in a book. 

In order to make clear the meaning of the quantity R, assume that an English 
text is encoded with the aid of a 27-ary code in which the same English letters 
are elementary signals. Such a ‘code’ is a certain method of shorthand writing 
of an English sentence by means of ordinary letters. In the case of a most efficient 
coding for writing an M-letter message we require on the average 


= M=(1—R)M 


elementary signals (letters), i.e., in comparison to an ordinary written text we 
are able to economise by RM letters. This conclusion obviously does not imply 
that we can arbitrarily discard RM letters and then the remaining (1 — R)M 
letters would suffice to reproduce the original message without error. In fact, 
for contracting a message by RM letters it is necessary to use a special ‘very best’ 
coding method, on applying which all letters of a message become independent 
and equally probable. Hence, it is clear that a text encoded here shall have the 
same character as the ‘sentence’ on p. 178, i.e., it will seem to be completely 
meaningless; it will be much more difficult to ‘read’ such a text than the ‘sentence’ 
given in the footnote on p. 180 (since the code words now correspond not to 
individual! letters but directly to very lengthy ‘blocks’ of letters). We further 
note that in such a coding any error will be ‘fatal’: when decoded it will give 
us a new meaningful text and either we do not notice this, or even if we do so, 
we cannot make out what was actually written. As regards contracting a text 
by means of direct omission of a part of the letters that are chosen at random, 
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we can Say in advance only that when more than RM letters are rejected we can- 
not a fortiori reproduce the original text without error. Specific experiments on 
the reconstruction of missing letters of English text have shown that usually a 
faultless reproduction is effected only if the number of discarded letters does 
not exceed 25% of their total number. 

In particular for the English language we have rendundancy estimates that are 
better than those for any other language. But even these estimates do not pro- 
vide any especially reliable data. Clearly, the problem of redundancy estimation 
is equivalent to the problem of estimating the value H... But then how to 
determine the latter quantity? Using digram and trigram probabilities of the 
English \anguage due to Pratt [149], Shannon [159] calculated the values of H, 
and H,. But it is clear that even H, is much far away from H.. To obtain 
further estimates Shannon utilized the fact that different English words have also 
different probabilities of occurrence in a meaningful English text. The English 
word probabilities (estimated by frequency counts in a sufficiently lengthy sample 
of “typical English text’) are given in special frequency dictionaries of the English 
language (see, e.g., Eldridge [92], Dewey [90] or Thorndike [167]; cf., also [70]). 
The data in various frequency dictionaries are in satisfactory agreement, They . 
show, for example, that THE is the most frequently used English'word (its prob- 
ability is close to 0.071); the next most probable word is OF, followed by AND, 
TO and so on. It is a remarkable fact that the probability pz of the appearance 
of the nth word (in the decreasing order of word probabilities) is close to 0.1/n 
for quite a large number (a few thousands, in fact) of most probable words 
(this result shall be considered below in greater detail). 

Using Dewey’s frequency dictionary, Shannon [21] constructed an example of 
the so-called ‘first-order word approximation’ to the English language, i.e., of 
a sequence of genuine English words in which words are selected independently 
but with true probabilities of their appearance in the English text. This example 
is given below : 


REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME 
CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE 
TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE 

MESSAGE HAD BE THESE 


It is clear that this sequence of English words represents a completely sense- 
less English text. 

Making use of more comprehensive data on the statistical characteristics of 
the written English language, Shannon constructed also a ‘second-order word 
approximation’ in which not only every word is selected in accordance with its 
probability to appear after a given preceding word but the statistical relation- 
ship between the two adjoining words is also taken into account (compare with 
the ‘second-order letter approximation’ which is related to entropy H, and is 
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described on p. 182). This new approximation has the form : 


THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH 
WRITER THAT THE CHARACTER OF THIS POINT IS 
THEREFORE ANOTHER METHOD FOR THE 
LETTERS THAT THE TIME OF WHO 
EVER TOLD THE PROBLEM FOR 
AN UNEXPECTED 


Here also the whole text is senseless but various parts of it, composed of 
several adjoining words, compare favourably with the passages from the sensible 
English writing. 

Let us now discuss the use of the term statistics for the approximate estima- 
tion of the entropies Hw of English language. It is clear that knowing the 
frequencies (probabilities) p,, p.,..., px of individual words (here K is the total 
number of words encountered in the text under consideration), we can calculate 
the ‘first-order word entropy’ by 


yiword) = —p, log p; — pz log p, — ... — px log px. 


Dividing the obtained value of Hivord) by the average number w of letters in 


an English word, we get an estimate for the conditional entropy H,, of order w. 
Expressly, it is easy to comprehend that Hiword)y w < H, because the correlation 
among w letters of one word is appreciably stronger than that among w arbitrary 
sequences of letters in a meaningful text. On the other hand, the ratio HOD ty 
is certainly larger than the average information H = H.. contained in one text 
letter, for the quantity H{")) does not at all take note of the dependence 
existing among words (see p. 207 et seq.).T 

According to Pratt [149], w = 4.5 for the English language. This enabled 
Shannon [159] to consider that Hivord) w can be used as an approximate 
estimate of the entropy H; or Hy. Shannon [159] also tried to extrapolate still 
further the series of values of Hw obtained. By this method he obtained a rather 
crude estimate of H, which agrees well also with some deductions from the 
existing cryptographic data. His results are summarized in the accompanying 
table (the above-mentioned values of Hy and H, are also included here). 


TABLE 
AA Ai, A, Ay Hf; or Hg Hy 
4.75 4.03 3.32 3.10 = 2.1 = 1.9 


fCf. also Urbach [169], in which Shannon’s method is reconsidered and some estimates of 
entropies Hy other than those in [159] are derived. (In [169] the space between words is not 
included in the number of letters. This fact is, however, quite simple to take noie of: see 
p. 203 et seq.) 
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Hence, it can be concluded that for the English language the redundancy R is 
in every case not less than 1 — (1.9/4.75) = 0.6, i.e., it certainly exceeds 60%. 

For a more precise estimate of the quantity R, it is further necessary to 
determine how much the quantity H,—the average information contained in a 
letter of a text given that the preceding seven letters are already known—differs 
from the limit value H... In other words, the problem in which we are interested 
is to what extent the arbitrariness in the choice of the next letter of an English 
text is essentially restricted by the knowledge of that part of the preceding text 
which is separated from this letter by more than seven letters (given that the 
following seven letters are also known). Since the average word length in the 
English language is close to four or five letters, i.e., it is appreciably less than 
seven letters, so the question here can only be the influence of statistical laws 
related to the dependence between successive words of English text (or even more 
general laws related to the succession of sentences). A direct solution of the 
problem we are interested in, through calcula‘ing the quantities Hy, Ayo, ... by 
using the formula given on p. 184, is impossible since for the determination of 
H, we need to know the probabilities of all nine-letter combinations, whose num- 
ber is expressed by a 13-digit number (trillions!). Hence to evaluate the quantity 
Hw for large values of N, we have to confine ourselves to indiret methods. Here 
we shall briefly sketch a clever method of this sort, due to Shannon [159]. 

The ‘conditional entropy’ Hw is a measure of the uncertainty of experiment 
an, consisting of finding the Nth text letter, given that the preceding N — | letters 
are known. This quantity naturally determines qualitatively the difficulty in 
guessing the Nth letter when the preceding N -- 1 letters are known. But the 
experiment for guessing the Nth letter can be easily set up: for this it suffices to 
choose an (N — 1)-letter fragment of a genuine English text and ask some one 
to guess the next letter.t The experiment can be similarly repeated many times; 
the labour involved here in finding the Nth letter can be well estimated by means 
of the average value Ow of the number of attempts entailed in the determination 
of the correct answer. It is clear that the quantities Qw defined for different 
values of N are definite characteristics of the statistical structure of a language, 
in particular of its redundancy. In fact, in the case of zero redundancy, know- 
ledge of an arbitrarily long fragment of text does not increase the probability 
of correctly guessing the next letter (this probability in all cases is 1/m, where n 
is the number of alphabet letters). On the other hand, the equality of the redun- 
dancy to the quantity 1/m can be described quite roughly as a statement that 
every mth text letter is ‘superfluous’, uniquely reconstructed by the preceding 
m — 1 letters. 

Obviously, the average number of efforts Qw with the increase of N can only 


Shannon suggests that questions be put to a number of persons, and then the person who 
gives best answers, on the average, be selected, for it is considered here that the pursuit is 
carried out in a most rational way, i.e., with a complete knowledge of the statistical structure 
of a janguage. 
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decrease, the stoppage of this decrease will show the fact that the corresponding 
experiments have the same degree of uncertainty as experiments for greater value 
of N, i.e., that the ‘conditional entropy’ Hw has practically attained the limit value 
Hf... Starting from these arguments, Shannon set forth a number of similar 
experiments, in which N takes the values 1, 2,3,..., 14, 15 and 100. In this 
connection, he observed that to find the 100th letter with respect to the 99 pre- 
ceding ones is a considerably simpler problem than to find the 15th letter with 
respect to the 14 preceding ones. Hence it can be concluded that H,, is sub- 
stantially larger than Ho, i.e., that it is by no means possible to identify Hj, 
with the limit value H.. The same experiments were later conducted on a 
somewhat larger scale by Burton and Licklider [81] for N = 1, 2, 4, 8, 16, 32, 
64, 128 and N = 10,000. From their data it is possible to infer that the quantity 
Hy (and, of course, also Hg, and Hy.,) practically does not differ from Hy,000, 
while the ‘conditional entropy’ Hy, is still appreciably larger than this quantity. 
Thus, it can be assumed that with increasing N the quantity Hw decreases up to 
values of N of order 30, but with further growth of N it remains practically in- 
variant; hence, instead of the ‘limiting entropy’ H.. we can speak, for instance, 
of the conditional entropy Hq or Hyp. . 

The experiments on guessing letters not only enable us to predetermine the 
comparative magnitudes of the conditional entropies Hw for distinct N, but also 
provide an opportunity to estimate even the values of Hw. This opportunity is 
connected with the fact that by the data of such experiments it is possible not 
only to determine the average number Qy of trials required to guess the Nth 
text letter with respect to prefixes N — 1, but also to estimate the probabilities 
(i.e., limiting frequencies) qi, q3,, . .. , g% of guessing correctly a letter by the 
Ist, 2nd, 3rd,..., nth trials (where N = 27 is the number of alphabet letters). 
It is obvious that Qn = qh X 1+ q% X24+...+ q% Xn. It is also easy 
to understand that the probabilities g1, g?,..., qg? are the probabilities p(a.), 
p(a:),..., p(an) of alphabet letters a,, a2, ... , @, arranged in order of decreas- 
ing probabilities. In fact, if no letter preceding the letter x to be guessed is 
known to us, then it is natural to assume first that x coincides with the most 
widely used letter a, (the probability of guessing correctly here being p(a;)); next 
we must assume that x coincides with a, (the probability of correct guessing here 
being p(a,)), and so on. This implies that the entropy H, equals the sum 


—q log gt — g’ log gt — ... — q® log q?. 


If, however, N > 1, then it can be shown straightaway that the conditional en- 
tropy Hw does not exceed the sum 


=3 == = — fr n * 
Ty 18 Vy — Ty log diy — ++ — Fy 108 9, (*) 
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The inequality (*) follows from the fact that the quantities qi, q%,..., 4% 
result from some averaging of the probabilities of the outcomes of experiment 
ay (see Shannon [159] or Savchuk [155]}. On the other hand, somewhat deeper 
(but at the same time not tedious) reasonings enable us to show that the sum 


(qv — qu) log 1 + 2(g), — 93) log 2 


+... + (nt — I)QR* — qh) log (t — 1) + nghlogn = (**) 


for every N is not greater than the conditional entropy Hw.t Thus, the expres- 
sions (*) and (**) (made up of the probabilities gh, q%, .. ., 9%, which can be 
estimated by the data on guessing experiments) define the bounds between which 
Hw must be contained. 

It is also necessary to keep in mind that both the estimates (*) and (**) are 
obtained with the assumption that g},, q%,...,q% are those probabilities of 
guessing a letter with respect to N — 1 preceding letters in first, second, third, 
..., trials, which prevail in the presumption that guessing always identifies the 
next letter most appropriately—with full regard to all statistical laws of the given 
language (see footnote on p. 188). Inthe case of real experiments, however, 
any mistake in the strategy of guessing (i.e., the variance of a Jetter identified by 
it from the required one, which stems from the exact statistics of language) in- 
evitably leads to an overstatement of both the sums (*) and (**). Hence, it is 
specifically expedient to take note of only the data of the ‘most successful 
guesser’, since this overstatement will be the least for him. However, since every 
guesser deviates sometimes from the best guessing strategy, it is practically im- 
possible to consider (**) as a completely reliable lower bound on true entropy 
(in distinction to the upper bound (*), which because of erroneous guesses may 
only become still larger). 

Furthermore, the values of (*) and (**) unfortunately do not come closer to- 
gether indefinitely with increasing N (starting with N ~ 30 these sums in general 
cease to depend on N); hence the estimates of redundancy for a language ob- 
tained here are rather loose.tt In particular, Shannon’s experiments [159] show 
only that Ayo) is apparently contained between 0.6 and 1.3 bits. Hence, it can 
be concluded that the redundancy 


Heo Mio 


R=1-—2 e1- 


Ho ~ log 27 


tThe derivation of this result due to Shannon has been further elucidated by Savchuk [155] 
and Maixner [124]. 
ttSee Savchuk [155], where the completely artificial ‘languages’ are constructed, for which 
Shannon’s entropy estimate (*), or correspondingly (**), is exact. 
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for English is almost certainly higher than 70% and quite probably may be close 
to 80% or even higher. The experiments due to Burton and Licklider [81] led 
to similar results: according to their data, the true value of redundancy for 
English lies somewhere between § (i.e., 67%) and # (i.e., 80%). Finally, Piotro- 
vskii, Bektaev and Piotrovskaya [148) indicate the following results of the letter- 
guessing experiments for English language: 72% < R < 84%. They have given 
also specific results for three different types of English text; these results will be 
discussed later. 

Shannon’s method of entropy estimation by guessing experiments was consi- 
derably improved subsequently by Kolmogorov and by Cover and King. This in- 
teresting development heralding the information theory approach to the written 
language will be discussed below in this section. At this stage, we shall only 
remark that the estimate of entropy H. of the English text due to Cover and 
King [86], which is apparently the best available at present (but is also only 
preliminary), shows that H. is smaller than 1.3 bits. It agrees well with Shan- 
non’s estimate and also shows that the redundancy R of English is not lower 
than 70%. 

Before we undertake an examination of the results for various other languages, 
it is appropriate to make a few additional comments. A mention has already been 
made of Shannon’s recommendation to take note of only the results of the subject 
who guesses the letters most successfully. It is clear that the amount of success 
in guessing characterizes the degree of guesser’s understanding (usually intuitive) 
of the statistical laws of language, i.e., a ‘feel of language’ intrinsic to a given 
subject (or ‘feel of style’ of a given author, whose text is used for letter-guessing; 
cf. the remark in Kolmogorov [15] to the effect of one of the guessers, who has 
obviously a particularly developed literary flair, having a ‘telepathic relationship 
with the author’). Hence from Shannon’s view point, the differences between 
the results of different subjects, participating in letter-guessing experiments, have 
to be regarded as undesirable (though, unfortunately, these are unavoidable), 
because these experiments rely on an ‘ideal guesser’ having the maximum amount 
of familiarity with the intrinsic statistics of the given language. It was, however, 
observed by Attneave [42] that in fact the differences in the entropy values ob- 
tained by different subjects in letter-guessing experiments are of definite interest, 
because such differences characterize quantitatively the level of language fluency, 
vocabulary and factual knowledge possessed by different subjects. In fact, efforts 
have already been made to utilize the results of letter-guessing experiments for 
an objective measurement of the extent of one’s grasp over a foreign language 
({161]; see also [112]) or one’s mother tongue (see, e.g., [135] which describes ex- 
periments on letter guessing of a highly specialized text by a few groups of per- 
sons with a highly diverse level of practice in reading a text of similar contents). 
Weltner [173] adduces quite rich material related to similar evaluations of ‘sub- 
jective information’ contained ina given text (for a given person), and emphasises 
the great value of such subjective information for educational purposes. 
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Weltner started from a slightly modified version of Shannon’s letter guessing 
method} and used it for the determination of subjective information contained in 
diverse types of texts (a scrambled text, scientific paper, poem, fiction prose text, 
newspaper text, ordinary and programmed textbooks) for different categories of 
readers (high school students from different schools, students of a teacher train- 
ing college etc.). The letter-guessing experiments due to Nemetz and Simon [134], 
which will be described in detail later, were also carried out on a collection of 
different guessers (e.g. specialists in mathematical statistics, teen-aged high school 
students, teachers of literature and mathematics in high school, and so on). In 
the present book, however, we shall not dwell upon the study of subjective in- 
formation, which is more predominantly a psychological than a purely math- 
ematical notion. 


Now let us bring into consideration the results related to various foreign 
languages. It is clear that for all languages that make use of the English alpha- 
bet, the maximum information H, that can be conveyed by one letter of a text 
(including space) has one and the same value:ft 


Hy = log 27 = 4.75 bits. 


However, the frequencies of the appearance of various letters and many hyphen 
letter combinations are obviously different in different languages. Thus, for 
example, by arranging all letters in order of increasing probabilities (starting 
with the most frequent of them), we arrive at a sequence of letters beginning 
with—ETAONRI... in the case of English language, where ‘—’ denotes the 
space between words (see p. 179 above). Moreover, in the case of the German 
language the corresponding sequence will begin with—ENISTRAD ..., and in 
the case of French with—ESANITUR .. . (see [75]). 

The average word-length defining the probability of ‘space’ is appreciably 
greater in the German language than in the English or the French; the letters W 
and K are encountered comparatively frequently in the German and the English 


{The main difference between Shannon’s and Weltner’s experiments is related to the fact 
that Weltner considers a 32-letter alphabet (including also some punctuation marks) and re- 
duces the guessing process to ‘binary choices’. 

TttThere are, however, languages which use ‘English-like’ (or, more correctly, ‘Latin-like’ 
alphabets but of a different number of letters. Several European languages do not use all the 
English letters (e.g., two letters K and W are not used in Spanish and five letters J, K, W, X and 
Yare not used in /talian). On the other hand, there are European alphabets which include some 
supplementary letters that differ from ordinary English letters by special marks (e.g., letters a 
and 6 are used in Finnish, Swedish, and German, 2, é and ¢ in French, % in Spanish,@ in 
Norvegian, a in Swedish and Norvegian and so on). We may, however, agree to include into 
Spanish, Italian and related alphabets all the missing English letters as letters having zero 
probability (in fact, these letters may occur occasionally in the correspoding texts when foreign 
names are mentioned). We may also agree to make no distinction between supplementary 
letters and related English letters (and in the case of German to write a, 6 and ij as ae, oe and 
ue). With this approach our statement regarding H, shall remajn correct for all languages that 
use the Latin alphabet. 
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languages, but have very low probability in the French; the combination TH is 
quite prevalent in the English language and so is SCH in the German language, 
but in other languages both these combinations are considerably less frequent; 
the letter C is almost always followed either by the letter Hor by Kin the German 
language, but not in the English or French and so on. Therefore, the first-order, 
second-order, ..., letter approximations, whose typical examples for the English 
have been set out on pp. 181, 182 and 183, will have quite different forms for 
different languages (though, of course, the zero-order approximation remains 
the same for all languages). Abramson [1] has presented the approximations 
of the first three orders for French, German and Spanish; the corresponding 
third-order approximations (related to entropy H,) are given by the following 
examples: 


JOU MOUPLAS DE MONNERNAISSAINS DEME US VREH BRE TU 
DE TOUCHEUR DIMMERE LLES MAR ELAME RE A VER IL 
DOUVENTS SO (in the French) 


BET EREINER SOMMEIT SINACH GAN TURHATTER AUM WIE 
BEST ALLIENDER TAUSSICHELLE LAUFURCHT ER 
BLEINDESEIT UBER KONN (in the German) 


RAMA DE LLA EL GUIA IMO SUS CONDIAS SU E UN- 
CONDADADO DEA MARE TO BUERBALIA NUE 
Y HERARSIN DE SE SUS SUPAROCEDA (in the Spanish) 


The three passages are senseless in any language, but nevertheless any one who 
has even a rudimentary knowledge of the indicated languages can easily deter- 
mine to which language each of these passages approximates. 

Using the letter frequency tables for different languages it is possible to 
compute the corresponding values (in bits) of the entropy H,. Some of the 
results are listed in the accompanying table. 


Language 


English German French Spanish Italian Portuguese 


Ni 4.03 4.10 3.96 3.98 3.90 3.91 


(see Barnard [73] and Manfrino [128]).f In all the cases the value of H, is seen 
to be appreciably less than H, = log 27 = 4.75 bits, and the values of H, for 


{The values of H, for the /talian and Portuguese languages were computed by Manfrino 
for three different types of texts and for the alphabet which did not include the space. For 
our purpose, we have taken the mean of the three Manfrino’s values of H, and then recalcul- 
ated the values of H, for the alphabet which includes space. The equation that enables us 
to make such a recalculation shall be given below jn this section (see p. 204). 
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different languages do not strongly differ here from each other. Of the examples 
cited, H, has the highest and least values, respectively, for the German and 
Italian languages. The higher value for the German is apparently due to the 
fact that the average word-length has the largest value in this language and 
hence the probability of space (which is the most frequent letter in all languages) 
is smaller in German than in all other considered languages. As regards the 
Italian \anguage, there are five different Latin letters which have a zero probab- 
ility in Italian; hence, the number of outcomes of the letter-guessing experiment 
%, is here smaller than in the other languages. The values of H, for some other 
languages not included in the above table can be found in [128], [113] and other 
references listed at the end of the book. 

To compute the entropies H, and H, for various languages it is neccessary to 
know the corresponding digram and trigram probabilities, i.e., the probabilities 
of two-letter and three-letter combinations. Some data on these probabilities 
for a number of languages may be found in Pratt [149]. Further data have been 
published recently by several authors engaged in the specific task of calculating 
the entropies H, and H3. For example, Petrova [143] utilized a sample made up 
from miscellaneous French texts, consisting of 30,000 characters, for the evalua- 
tion of the French digram and trigram probabilities. With a similar objective, 
Manfrino [128] used three 10,000-letter samples from the Italian scientific, hist- 
ory and newspaper texts as well as three samples of about the same length from 
the Portuguese (Brazilian) scientific, fiction and newspaper texts; however, in 
contrast to Petrova, he did not include the space in the number of alphabet 
letters. Lebedev and Garmash [120] treated a passage from L. N. Tolstoy’s novel 
War and Peace, containing roughly 30,000 letters; they catered for space in the 
number of letters and considered the Russian alphabet as consisting of 32 letters 
(they made no distinction between the letters e and e, B, and 8, which is also the 
practice followed in almost all Russian telegraph codes). Wanas et al. [128] 
analyzed a sample from one of the Arabic newspapers made up of 64,000 letters 
(the Arabic alphabet selected for this study consisted of 32 letters). There are 
also more examples of the related studies which we shall not mention here and 
refer the reader to the ‘references’ at the end of the book. The results of investi- 
gations cited are listed in the following table which contains, for the sake of 
completeness, the values of Hy, H,, H,, H, and the entropy values for the English 
language as well: 


a aa ST ED 


Language 
English French Italian Portuguese Russian Arabic 
A, 4.75 4.75 4.39 4.52 5.00 5.00 
R, 4.03 3.95 3.90 3.91 4.35 4.21 
fi, 3.32 3.17 3.32 3.35 3.52 3.77 


Hy 3.10 2.83 2.76 3.20 3.01 2.49 


as mEaninn ans een ene IRL gimme 2 
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The different columns of the table do not differ sharply from each other, which 
does not seem to be surprising. However, the table can be hardly used for 
evaluating the redundancy for the written text in the indicated languages, since 
H, is obviously still quite far from the limiting value Hw. 

An approximate estimate of Hv with N > 3 can be obtained with the aid of 
Shannon’s method (based on the letter-guessing experiments), or some modified 
variant of it. A number of attempts undertaken from this motivation are des- 
cribed in the existing literature. However, the fact remains that most of the 
results achieved in this direction are of an even more preliminary character than 
the rough results for the English language presented above. 

The redundancy for German has been investigated quite thoroughly by Ktipf- 
miiller [118]. By using the available data on the frequencies of occurrence of 
different syllables and words in German and performing some experiments on 
guessing the succeeding syllables or words of a German text with respect to the 
known preceding excerpt, Kiipfmiiller inferred that for the German language 
H.~ = 1.3 bit. This implies that the redundancy R of this language is close to 


1.3 


[Saas 


= 0.73, 


a value having the same order of magnitude as the estimate of the redundancy for 
English deduced above. The value of H, for German may be found, in particular, 
in [93]. The results of letter-guessing experiments for German language are pre- 
sented in Piotrovskii, Bektaevy and Piotrovskaya [148]. These include two esti- 
mates of the redundancy R deduced from Shannon’s upper and lower entropy 
limits (*) and (**) (see pp. 189 and 190) related to three different types of the 
German text (conversational language, fiction and various business texts). The 
average results of Piotrovskii et al. for the German language at large are very 
close to the corresponding results for the English language: they indicate that 
MNYZSRK 85%. 

A study of the entropy and redundancy of the French language has been made 
in great depth by Petrova [143]. Her results related to the values of the entropies 
Hw for N= 1, 2 and 3 have been briefly described above. To determine the 
values Hw, when N is large, the letter-guessing experiments were employed, 
applying partly a refinement of the procedure suggested by Kolmogorov, of which 
we shall say more later on. The deductions in [143] have yielded the estimate 
H.. = 1.40 bits and, consequently, R =~ 71%. A similar (but somewhat cruder) 
study of the redundancy for the Swedish language has been carried out by Han- 
sson [128] leading to the result that H. < 2 bits and R > 1 — (2/log 30) = 59% 
(Hansson considered the 29-letter Swedish alphabet, i.e., the total number of 
letters accounted for by him, with the inclusion of word space, came to 30). For 
several other evaluations of the entropies and redundancies of vatious languages 
the reader is referred to [113], [128], [147] and [148]. In [148], in the particular, 
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there are adduced the estimates of redundancy R from above and below for three 
diverse types of texts (colloquial, fictional and scientific) written in seven langu- 
ages (English, German, Russian, French, Polish, Rumanian and Kazakhian). These 
estimates have been obtained with the aid of letter-gussing experiments on the 
basis of relations (*) and (**) and deviate only slightly from each other irrespec- 
tive of the seven languages involved. 

The results indicated show that the redundancy estimates of most of the Europ- 
ean languages do not diverge widely from each other but this fact does not per- 
mit us to conclude that the same must hold also for the languages which are 
either quite apart in their linguistic structure or differ sharply in their alphabets. 
In this connection, the investigation of Newman and Waugh [138] is of interest. 
They have endeavoured to compare the entropies Hw and redundancies R for 
three languages with appreciably distinct numbers of alphabet letters: for the 
Polynesian Samoyan language, whose alphabet contains altogether 16 letters 
(nearly 60% of which are vowels), for the English and the Russian. In the case of 
Russian, the specially chosen texts were printed m old orthography (used in 
Russia up to 1918), using a 35-letter alphabet. It is natural for the quantity Hy 
to have highly different values for these three languages (see the accompanying 
table). The values of H, listed in this table for the three languages differ still 


Samoyan English Russian (old orthography) 
A, log 17 = 4.09 log 27 = 4.75 log 36 = 5.17 
A, 3.40 4.08 4.55 
A, 2.68 3.23 3.44 


more sharply. (The letter frequencies used in the evaluation of H, have been 
compiled by Newman and Waugh on the basis of an analysis of the same pass- 
age from three translations of the Bible having the length of nearly 10,000 charac- 
ters.) The variations in the values of H, roughly signify that the probability dis- 
tribution of individual letters is most uniform in Russian, but in Samoyani it is 
most nonuniform. To a considerable extent this conclusion is explained by the 
fact that in Samoyan the average word-length is quite small: it is close to 3.2 
letters against 4.1 letters for English and 5.3 letters for Russian texts considered. 
Hence, a word space, the most frequent character, has the largest probability in 
Samoyan, less in English, and still less in Russian. However, the values of H; for 
the three languages are found to be closer than those of H,: the two-letter cor- 
relations in Russian are more stringent than in English and still more so than 
in Samoyan. 

Unfortunately, the successive values of Hv for N > 2 given by Newman and 
Waugh are not reliable, for these have been obtained by them by means of a 
disputable method developed by Newman and Gerstman [137]. However, their 
conclusions concerning the comparative values of Hw for the three languages 
strike to be plausible. According to these conclusions, the values of Hw decreas¢ 
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most rapidly in Russian and most slowly in Samoyan; as a result, starting from 
approximately N = 10 the values of Hw (and, consequently, also of H..) for the 
three languages are found to be sufficiently close to each other. This signifies 
that the average amount of information per text letter for three languages having 
appreciably distinct numbers of alphabet letters is approximately the same. If 
this conclusion is true, then it obviously implies that the redundancy is consi- 
derably greater for languages affluent in the number of distinct letters than for 
those with meagre alphabets. 

Note also that in all European languages the vowels are considerably more 
frequent than the consonants. This fact is responsible for significant differences 
in the frequencies of individual letters, which appreciably affect the value of the 
the ‘first-order entropy’ H;, (and also the ‘limit entropy’ H = H.. and the redun- 
dancy R) of a language. The position is different in a number of Oriental 
languages. For instance, in Hebrew the vowels are not used at all: they are 
omitted in the written text and are supplied by the reader ‘according to sense’ 
(this is plausible by virtue of the redundancy of a language). It is clear that 
the statistical structure of a text written in this language differs sharply from 
that encountered in European languages, in view of which the values of all the 
information-theoretic characteristics of a language may take here quite different 
values (in particular, the redundancy must reduce appreciably). As an illustra- 
tion of this remark, a reference may be made to Bluhme [78], who compared 
statistical characteristics of a collection of three-letter words from Hebrew and 
English and discovered that for this collection 


H{Heb) ~ 3.73 (bit/letter) and R(H*) = 1 — i = 0.16, 


whereas 


HEN8) ~ 0.83 (bit/letter) and RE") = 0.82. 


The entropy of individual Indian languages was also studied in detail in the 
sixties, in the first place the Dravidian languages prevalent in South India and 
belonging to the stock of the most ancient human languages. In [160], starting 
from the statistical language data (and taking note of the correction introduced 
in [74]), the values of lower order entropies are found for several Indian langu- 
ages and Shannon’s ‘method of guessing experiments’ is also used to estimate the 
values of Hw when N is comparatively large. In this connection, we note that 
in comparison to works related to European languages, new difficulties arose 
because of some uncertainty of alphabets in a majority of the considered langu- 
ages. Thus, for example, in Tamil there exist both classic and modern alpha- 
bets; in the modern alphabet (close to the alphabets of a number of other Indian 
languages) there exist 12 vowels, 18 consonants, 216 unified consonant-vowels 
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and one more unpronounced symbol (Aitham) for a special purpose. In Siro- 
money’s work [160] Aitham is completely ignored, and ‘consonant-vowels’ are 
considered as pairs of letters; however, such an approach to the Tamil language 
is not the only one possible. We shall set forth later (see p. 214) some of the results 
of studies devoted to Indian languages. 


Finally, let us note that the differences in the presently available estimates of 

the values of the entropy H = H.. (or the quantities Hw, where N is moderately 
large) manifested for different European languages by means of the ‘guessing 
experiment method’ are, as a rule, appreciably smaller than the accuracy of the 
respective estimates determined by the difference between the lower and upper 
bound expressions (*) and (**) for the Nth order entropy. 
- Thus, the Shannon method turns out to be clearly inadequate for determining 
differences in the specific entropy (per letter) for different languages, although 
the existence of differences in the average word-length and the length of parallel 
texts, having the same content, for different languages (see, Ramakrishna and 
Subramanian [151], and also the last reference in [160])} creates an impression 
that these differences in specific entropies may be of an order of 10-20%. The 
same can also be said of the differences in the entropies of texts of different 
characters (in particular, due to different authors) written in the same language: 
it is quite obvious that the differences in these may be sufficiently large, but they 
may also be detected by means of the Shannon method only in the most exclu- 
sive cases (like those to which are related the works of Frick and Sumby or 
Fritz and Grier, mentioned on p. 212) 

In this connection, it is highly desirable to have a more precise method for 
determining the entropy of a language. Kolmogorov stated that such a method 
is comparatively simple to obtain by further sharpening the ‘guessing method’. 
In particular, Kolmogorov noted that in principle the guessing method (with 
the assumption that the guessing subject always follows an ‘optimal strategy’, 
which stems from a complete knowledge of all statistical regularities inherent 
in a given language) enables us to obtain not only an estimate of upper 
and lower bounds of entropy, but also an exact estimate of the value of this 
quantity. In fact, assume that while guessing one does not name only one 
alphabet letter each time selected in accordance with the order in which the 
probabilities of letter appearance decrease, but directly indicates all condi- 
tional probabilities p¥, p¥,..., p of the occurrence of the Ist, 2nd,..., 
nth alphabet letter (given that N —1 text letters preceding it are known). 


THowever, the two indicated works are in fact of interest only from the viewpoint of the 
formulation of problems, but not from the viewpoint of the specific results obtained here. 
The reason is that for evaluating the ‘efficiency’ of different languages, there has been employ- 
ed only a comparison of ‘first-order entropies’ H, of these languages, without taking any 
account of the statistical relationship between various successive text Jetters, which is extreme- 
ly important in linguistic structure. 
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Suppose now that this experiment is repeated many times and each time the 
value of the quantity —log p?’ is calculated, where k is the number of the letter 
that actually appeared. Thus, in every individual experiment of ‘guessing’ from 
n given numbers py, .. . , p¥ (where n is the number of alphabet letters), in fact 
only ony number is taken into account, but expressly the one which is not known 
beforehand. It is now easy to show that if the conditional probabilities are 
always determined exactly, then the average value of the enumerated quantities 
—log pj’ (i.e., the sum of all such quantities which are determined in a large 
number of M experiments, divided by M) for unboundedly increasing M tends 
to the true entropy Hw of one text letter. 

This method seems to be completely impracticable: it is inconceivable to 
demand of guessing subject that every time he would determine the entire collec- 
tion of conditional probabilities of all possible letters and, in addition, that none 
would be in error (cf., in this connection, the analysis of the work by Cover 
and King [86], given below). It is, however, essential that any error in the speci- 
fied values of conditional probabilities cause only an increase in the corresponding 
sum of the values —log p/’ (this statement, as it is easy to show, follows from the 
fact that (*) on p. 189 gives an upper bound of Hw). Hence it is completely 
permissible to restrict beforehand the set of probability distributions, which can 
be named in guessing, and with that substantially facilitate its performance; here 
the sum thus obtained of values —log pj’, divided by the number M of experi- 
ments, is all the same the upper bound on the true entropy Hw. 

In real experiments conducted under the guidance of Kolmogorov on Russian 
literary texts, the following forecasts were provided for guessing (cf. [154]) : 


(i) one specific (say, kth) alphabet letter would certainly be next letter; 

(ii) one of the two or three alphabet letters to be indicated in guessing would 
certainly be the next letter; 

(iii) one specific (say, kth) alphabet letter would probably (but not certainly) 
be the next letter; 

(iv) one of the two or three letters to be indicated in guessing would probably 
be the next letter; 

(v) moreover, the guessing would permit one to say that one does not know 
which will be the next letter. 


It was also assumed that each of these statements is equivalent to the choice 
of the following conditional probability distribution for the succeeding text 
letter: 


(i) the kth letter has some preassigned large probability P; however, for 
the ith letter, where i # k, the probability of its appearance is taken as 
Pi = pi/[(1 — P)(l — ps)], where pi and px are unconditional probabili- 
ties of the ith and kth alphabet letters (these probabilities are known for 
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many languages, including the Russian language; for English these are 
listed in the table on p. 179); 

(ii) the two or three letters chosen have the same conditional probability 
P/2or P/3. The remaining letters have, as before, the probabilities p;, pro- 
portional to their unconditional probabilities p,; 

(iii) the kth letter has some fixed probability Q (smaller than P!), but the 
ith letter, for i 4k, has the probability p; = p; x [(1 — Q)/(1 — p;)]; 

(iv) the two or three letters chosen have the same probability Q/2 or Q/3, 
but the remaining letters have probabilities proportional to their uncon- 
ditional probabilities; 

(v) the conditional probability of the appearance of ith alphabet letter for 
all i is taken to be equal to its unconditional probability p;. 


The probabilities P and Q remain so far undetermined; however, since any 
inaccuracy in the indicated conditional probability distribution may only increase 
the estimate obtained for Hy, it is completely admissible to select these two 
probabilities, according to the known experimental results, in such a way that 
the sum of all quantities —log py’ (where p¥ is the predicted conditional prob- 
ability of the letters having actually appeared) is the least possible. 

It is easy to calculate that, with such definitions of the probabilities P and Q, 
the final estimate of the entropy Hw is given by the formula 


Hy = 77 (Myh, + Mahz + M, + M; log 3 + SI, 


where M is the total number of experiments; M, is the number of forecasts of 
type (i) or (ii); Mz, is the number of forecasts of type (iii) or (iv); M; is the 
number of forecasts of type (ii) or (iv), in which the two possible letters are 
indicated; M, is the number of forecasts of type (ii) or (iv), in which the three 
possible letters are indicated; h, = —q, log g, — (1 — q,) log (1 — q,), where 
4, = m,/M, and m, is the number of errors in the forecasts of types (i) and (ii); 
h, = —@, log q. — (1 — gq) log (1 — qe), where g. = m,/Mg is the average 
fraction of errors in the forecasts of types (iii) and (iv); finally, S is the sum 
(extended over all cases of errors in the forecasts of types (i), (ii), (iii) and (iv), 
and all ‘rejections’, i.e., ‘forecasts of type (v)) of the expressions —log p;’, where 
p; is either the ‘unconditional probability’ p; of a letter having actually appeared 
(in the case of forecasts of type (v)), or is the forecasted probability’ p; divided 
either by 1 — P (in the case of forecasts of types (i) and (ii)), or by 1 — Q (in 
the case of forecasts of types (iii) and (iv)). 

The above equation appears, at the first glance, comparatively intricate, but 
in practice it is found ‘to be sufficiently convenient and does not involve very 
cumbersoms calculations. Guessing experiments of such kind were carried out 
in the statistical laboratory of Moscow State University, which enabled the 
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experimenters to obtain in the case of classical nineteenth century Russian prose 
of S. T. Aksakov (The Childhood of Bagrov the Grandson, a novel) and I. A. 
Goncharov (Literary Party, a short story) the bound on the specific entropy Ho 
(not differing, say, from H;)) of the order of 1—1.2 bit. This bound is appar- 
ently quite precise (probably exceeding the true value of Ho by not more than 
10—15%). According to this, the value of redundancy for the literary language 
of Russian classical prose is close to 80%. 

The recent work of Cover and King [86] (containing an extensive bibliography) 
is quite similar to Kolmogorov’s investigation. They also used a refined variant 
of Shannon’s letter-guessing experimental technique. The main idea underlying 
their work is, in fact, identical to Kolmogorov’s postulate (see pp. 198-99) that 
if the guesser is asked to list all the conditional probabilities p¥, PY,..., p¥ of 
the occurrence of various alphabet letters at the Nth place after N — 1 known 
letters, then, in the case of an ideal error-free guessing, the average value of 
—log p¥, where k is the number of the letter that has actually appeared, will give 
an exact estimate of the entropy Hw. (This result was obtained in a slightly 
different form by Cover and King independently of Kolmogorov.) 

The procedure proposed by Cover and King has the form of the following 
‘gambling scheme’. Let us consider a subject having at the beginning the cap- 
ital Sj = 1 dollar. The subject knows N — 1 letters of the text and wants to 
place bets on the next letter. He is allowed to wager any percentage p:S of his 
capital S on the ith alphabet letter (where, of course, p) + ...-+ Pa = 1 and 
n = 27 is the number of different English letters including space). If the ith 
letter appears as the Nth letter of the text, the subject wins the capital npjS = 
27p,S. The process is repeated many times; let us denote by Sw the subject’s 
capital after M bets. If the subject permanently wagers the same capital $/27 
on every letter, then he will preserve the same capital after all bets. If, however, 
he distributes his stakes inhomogeneously using the known statistics of the lang- 
uage, then his capital will increase with a very high probability. Cover and 
King showed that the optimal gambling strategy is to wager every time a per- 
centage of the current capital in proportion to the conditional probability of 
the next symbol, i.e., to select pi, ..., Pa equal to the conditional probabilities 
of alphabet letters when preceding N — 1 letters are known. If the stakes are 
selected by following this strategy, then the capital Sm will increase with prob- 
ability 1 and the quantity 


1 1 
log n (1 — loBs Su = log, 27 (1 — Fp 10827 Sm ) 


will tend to Hw as M — oo, It is clear that Su = (27)” PR Phe ets Pheu where 
k; is the number of the letter which actually appeared in the ith bet, and 


1 1 
ae ¢ ~ yg 08 Ss) ~~ M (Ios rf, + log py, +... + loB Phy) 
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therefore, this expression due to Cover and King is evidently equivalent to 
Kolmogorov’s proposition formulated on p. 199. If, however, p; differ from 
true conditional probabilities of alphabet letters (which are not known exactly 
to any real subject), then the increase in the capital will be slower and hence 
the limit of the quantity indicated, as M — oo, will give an estimate of Hw 
from above. 

Cover and King performed an actual experiment on evaluating the entropy of 
English \Janguage by the procedure described. The text used was taken from the 
same book, Jefferson the Virginian by Dumas Malone, which was employed in 
Shannon’s letter-guessing experiments [159]. Twelve persons were selected and 
all of them were given the same part of the book from its beginning and up to 
an abrupt end in the middle of a word. They were allowed to read as much of 
the book as they desired up to the selected end in order to familiarize themselves 
with the style of author’s writing. Each person was also allowed to use tables 
of English letters, digram and trigram probabilities (Cover and King noted, 
however, that the use of tables did not help to improve the results). Under 
these conditions, all 12 ‘gamblers’ were asked to distribute stakes p,S, p.S,...,; 
P27S on the possible appearances of the next letter (here p, + p. +... + por = 1 
and S is the current capital of a gambler). After every bet the actual next letter 
of text was exposed, capital of every gambler was recomputed, and the whole 
procedure was repeated again. The game was finished after 75 bets (work at a 
computer terminal for any of 12 subjects took about 5 hours). Since the num- 
ber N — 1 of the text letters known beforehand was quite large in this experi- 
ment, the estimate obtained here refers directly to H... The details of ‘gamblers’ 
decisions’ were not set forth by Cover and King. However, it is clear that the 
subjects involved in experiments hardly proposed 27 different numbers p,, po, 

. » Poz at every bet, but they apparently selected one probability distribution 
from a small set of simple model distributions similar to (i)—(v) explicitly for- 
mulated by Kolmogorov. 

The results of all gamblers (the ‘final capital’ S,; and the resultant entropy 
estimate) are listed at the end of Cover and King’s paper. All estimates range 
between 1.3 and 1.9 bits per letter. Moreover, the best subject estimate of Hw, 
the average capital estimate (based on the total capital of gamblers) and the so- 
called committee gambling estimate (based on a more complicated averaging of 
the results of different gamblers)—all lead to the value H.. == 1.3. bit/letter (i.e., 
lead to the inference that, for the written English, H.. is in fact smaller than 1.3 
bit/letter, in other words, that R > 73%). These results agree well with the 
results of similar Kolmogorov’s investigations related to the Russian language. 

Cover and King also attempted a similar experiment for a different type of 
text, namely for a text from tbe book, Contact : First Four Minutes by Leonard 
Zunin, which happened to be of greater professional interest to the selected 
‘gamblers’ than the Dumas Malone book. This experiment was not concluded 
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till the publication of Cover and King’s paper [86]. However, the results of 
first two subjects yielded a slightly lower entropy estimate. At present it is not 
possible to decide whether this small difference in the entropy estimate for the 
two books is real or fictitious. 

The procedure suggested by Cover and King was applied recently by Nemetz 
and Simon [134] for estimating the entropy of the Hungarian language. The 
Hungarian alphabet is different from the pices alphabet : ithe former includes 
9 additional ‘accented’ letters (a, ¢, z, 6, 6, 0, u, u, and u). but excludes the 
letters g, w, x and y which may appear ee in foreign words and foreign names. 
However, such foreign words and names are becoming more frequent in the 
modern Hungarian texts; therefore, Nemetz and Simon did not exclude these 
letters from. the Hungarian alphabet. On the other hand, they identified the 
. letters 7, 6, 0, u# and u, respectively, with i, 0, 6, u and u and hence considered 
a 31-letter Hungarian alphabet (including the space between words). The authors 
rewrote a random collection of articles from recent Hungarian newspapers con- 
forming to the alphabet decided upon by them; then an excerpt of about 100 
letters was read aloud to a group of selected subjects and they were asked to 
distribute stakes on the possible appearances of next letters. The first 10—15 
attempts to forecast the next letter were carried out just for educating the gamb- 
lers and then the gamble began in right earnest and ended after 50 or more bets. 
In all the cases the committee gambling estimate due to Cover and King gave 
the best result. Nemetz and Simon’s experiment led to the conclusion that the 
entropy of written Hungarian lies between 1.13 and 1.49 bit/letter (the average 
estimate yielded by this experiment is H.. = 1.25 bit/letter, ie, R = 75%). 
Of course, this estimate is also a preliminary one and it obviously overestimates 
H... (i.e., underestimates R). 


Recall that throughout in the foregoing we added to the number of ‘letters’ an 
empty space between the words (this is quite natural from the view point of 
telegraphy). However, it is sometimes of interest to consider also the ordinary 
alphabet without making allowance for space; thus, for example, we can take 
up the question of the information contained in one printed text letter. A few 
examples of entropy evaluations for an ordinary (‘spaceless’) alphabet have been 
presented above (cf. Manfrino [128]). It is clear that, if we drop space from our 
consideration, then the results deduced above undergo some modifications. 
Thus, for instance, it is now necessary to consider the English alphabet as a 26- 
letter alphabet, so that Hy = log 26 = 4.70 bits. The frequencies of individual 
letters also change their values and this leads to a modified value of H,. The new 
value of H, can be easily deduced from the early value (for an ‘alphabet inclusive 
of space’) if the average length w of the English words is known. Indeed, if 
space is considered as a zero alphabet letter, then its probability clearly equals 
Po = 1/(1 + w) (on the average, one space per w + 1 ‘letters’ of a text with 
spaces). Moreover, the relative frequencies of all ‘real’ alphabet letters get 
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changed in the same proportion if space is included in the ‘letters’ under consi- 
deration (because the quantum of all ‘real letters’ remains here unchanged). 
Hence, if P's Pp; - ++» Pp, are the probabilities of Ist, 2nd,..., mth letter of an 
alphabet without a space, then corresponding probabilities p,, po, ..., Dn of the 
same letters with the inclusion of ‘space’ in the number of ‘letters’ are given by 
the equations: p; = (1 — po)p, i = 1,2,...,”. Let us now write 


— Py log po — Py log py — .. . — Pn log pan = H{with space), 


and 
—D; log Pi — Ps log Po Eh. Bt P,, log P,. oh (without space) 
Then, it follows easily from the above equations that 


Hiwith space) — —p, log pp — (1 — ro) log (1 — po) + (1 — po) Hwithout space) 
= h(po) + (1 — pro) Froniout space} 


where h(pp) is the function defined on p. 49 (this equation was referred to in the 
footnote on p. 193; see also the footnote on p. 206). It is known that w = 4.5, 
Po ~ 1/5.5 = 0.182 in the case of English language. Using this value of pp and 
the value of Hoth space) — 4.03 bits given on p. 193 we obtain the new value 
eathont space) = 4.14 bits. 

The values (in bits of the letter entropies Ho, H,, H., H, and also the appro- 
ximate estimate of the values of H, (or H,) and H, for the 26-letter English 
alphabet, obtained by Shannon [159] with the rejection of spaces between words, 
are listed in the accompanying table. 


77h A, A, Hy A, or H, Hy 


4.70 4,14 3.56 3.3 = 2.6 = 2,3 


By comparing this table with that given on p. 187 we are convinced that an 
allowance for spaces between words in the English language leads to an increase 
in the entropy Hy and a decrease in all succeeding entropies Hw. The fact that 
for all languages Aywith space) > jy(without space) is completely obvious, since 
we always have log n > log (n — 1). 

Furthermore, an allowance for space increases by one number the possible 
outcomes of letter-guessing experiment «, and thus increases its degree of un- 
certainty H,, but simultaneously this allowance leads to the emergence of an 
additional ‘letter’ with an extremely large probability in comparison to others, 
which facilitates the forecast of the outcome of experiments a, and, consequently, 
decreases its degree of uncertainty H,. We see that the second circumstance 
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turns out to be more important in the case of the English language and this leads 
to the inequality H{¥ith space) < zy(without space) OF course, the last result may 
not be true for all existing languages. For example, w = 5.92 for German (see 
Pratt [149]), i.e., the average word-length is here considerably larger than that in 
English and, consequently, the space probability po for 27-letter alphabet is consi- 
derably smaller in German than in English. This curtails the role of the second 
circumstance indicated, and in fact the application of the relationships derived 
between {With space) and prohout SPACE) to the German language leads to the 


conclusion that here Herehou space) is slightly smaller than H(™ith space) ~ 4 19 
bits (cf. p. 193). However, when N is sufficiently large (exceeds the average word- 
length), the outcome of an experiment consisting of the determination of the 
Nth text letter with respect to the known N — ! preceding letters in all those 
cases in which this Nth letter turns out to be a ‘space’ is practically defined uni- 
quely by the very structure of a language. (It is easy to understand that for 
large N an error in guessing the outcome of this experiment most usually takes 
place only when the Nth letter happens to be the first, or the second letter of a 
new word.)}+ This implies that an allowance for space appreciably decreases the 
uncertainty of this experiment and hence if N is large, then 


Hiwith space) — H{vithout space) 


for every language. ; 

It is also possible to obtain an exact relationship that connects two values of 
redundancy R, calculated with and without rejection of word spaces. In fact, 
consider two identical sufficiently long texts, which differ only in that in one 
of them we do not take note of spaces between words. Each text is uniquely 
reproduced from the other: cbviously, all word spaces can be discarded in an 
ordinary text and it is usually quite easy to restore the spaces in a ‘closely’ 
written (without word spaces) text in a familiar language. Hence, it can be con- 
cluded that the ‘total information’ (the product of ‘specific information’, or ‘in- 
formation per text letter’ H.., by the number of letters) contained in both texts 
must be one and the same. But since the number of ‘letters’ in a text with spaces 
exceeds the number of ‘letters’ in a ‘closely’ written text by (w + 1)/w times, 
where w is the average word-length (because on an average one space is required 
for w text letters), hence 


ri space) _ pent space) | a 1 


{This intuitively obvious stafement is in good agreement with the quantitative data due to 
Carson [82). According to his estimate of the numerical values of the entropies of first, 
second, third, ..., letters of a word in printed English, the entropy of a letter decreases 
sharply with the increase in the number of preceding word letters, 
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Noting further that the probability p, of space equals 1/(w + 1) and, conse- 
quently, w = (1/po) — 1, we can rewrite this equation asf 


ae space) __ pr vithout space) | Po _ 


fore) Mi 1 ’ 
or 
Hwith space) __ (= Po) without space) 


However, if the total number of alphabet letters (including space) is n, then 
pron space) __ = log n, Hethout space) __ log (n — 1), and 


Hwith space) _ p\vithout space) ; log n 
Hwith Space) Heehout space) x (1 — Po): log (n — 1)’ 


or 


log (n — 1) 

_.. R(with space) __. p(without space) _ pte — 1h ore 
(1—R )=(1—R ) x (1 — po) log # 

This is the equation we need to connect the values of redundancy for a language 
that are obtained with and without the rejection of spaces. 


Similar arguments may also be used for ascertaining the average amount of 


tThis result can be proved in a highly straightforward manner even without reference to 
the constancy of ‘total information.’ In fact, suppose that ay is an experiment consisting of 
guessing the Nth letter of a text with word spaces with respect to the N — ! preceding letters. 
The outcome of a, can be determined jn two steps: in the first place, it is verified whether or 
not the Nth ‘letter’ is a space (experiment 8); if it is not a space, then we further ascertain 
what this letter specifically is (experiment a). If pp is the probability of a space, then obvi- 
ously we are required to carry out the second experiment ay only in the (1 — p,)th fraction of 
all cases. Hence, it follows that 


H(ay) = H(8) + (1 — Po) H(ay)s 


where Ai(a,), H(a,)) and A(@) are the average conditional entropies of corresponding experi- 
ments, given that the preceding N — ! letters are known to us (see Section 2.4). If N=1, 
then obviously H(B) == —po log py — (1 — Pp) log (1 — py) = h(P»), Hay) = {with space), 


Hay ) = A (without space), and we are back to the equation derived above for H’ H with space). 
However, for large N it can be considered that H(p) = O (the space is restored uniquely with 
respect to the preceding N—1 letters) and H(ay) = Hwith space), H(aiy) = 1 (without space); 
hence 


Aovith space) __ (t — p,) poumnont space) 
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of information H°') contained in one text word. The zero-order entropy of 
one word FOr) = log Kcan be estimated by calculating the number of words 
K in any sufficiently complete dictionary of a given language; the entropy 


Hers) = —pP, log pi — p2 log p, — ... — px log pK 


can be calculated with the aid of a ‘frequency dictionary’, indicating the frequ- 
encies (probabilities) p,, po, ..., px of individual words of a given language (cf. 
p. 186 above). However, a direct determination of the ‘first-order conditional 
entropy’ H{word) demands knowledge of the frequencies of all possible two-word 
combinations, whose determination is practically impossible since the total 
number of such combinations is immensely large. The problem of calculating 
the succeeding ‘conditional entropies’ Aree provord), ..., iS even less tract- 
able. Moreover, one must bear in mind that the statistical relations between 
individual words are frequently appreciably more rigid than those between the 
letters (the appearance of a word ‘ENTROPY’ in a text restricts the probabilities 
of succeeding words more strongly than, say, the occurrence of a letter ‘G’ res- 
tricts the probabilities of succeeding letters) and that these relations are consider- 
ably more ‘long-range’ (if the word TOPOLOGICAL appears at the beginning of 
an arbitrarily voluminous book, then this sharply decreases the probability of 
the occurrence of the word ‘RHENOCEROS'’ at its end). This creates the im- 
pression that the problem of the determination of the ‘limit entropy’ (‘specific 
information’) H”°4) must be exceedingly difficult. 

Now let us associate two texts with each other, one written in the usual way by 
means of letters and the other ‘hieroglyphic’, in which a whole word is taken as 
a ‘letter’ (hieroglyphic writing is characterized by the fact that in it individual 
characters denote whole words). Here each of the two texts is obviously uni- 
quely reproduced from the other, since by knowing all letters of any text we 
also know thereby all words occurring in it, and a knowledge of words is equi- 
valent to knowing all the written letters. Hence here also the ‘total information’ 
contained in two texts remains the same, i.e., 


H word) x number of text words = H‘l*tter) x number of text letters. 


But since the ratio of the number of letters to the number of words equals the 
average length of the word, we have 


H word) = A without space) y w, oF H (word) = ywith space) y (w + 1), 


where w is the average word-length (and hence w + 1 is the average number of 
‘Jetters’ per word, to which is added also the space between words), 
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The preceding equation implies the relation 


Hword) pyletter) log K 


Fyvord) = pr(letter) x (w + 1): login ’ 
or 


_— p(word), — ¢y — p(letter) logn 
(1 — ROOD) =m (1 — REM) XW + 1) ER: 


where, as above, w is the average word-length, K is the total number of words 
encountered in the text under consideration, nis the number of alphabet ‘letters’ 
to which is added also the space between words; here, as almost everywhere in 
the above, by Htletter) ang Rietter) is understood H with space) ang Riwith alee 
In particular, for the English language we have n = 27 and w+ 1 = 5.5. 
Putting K = 50,000 (the approximate number of words in a moderately com- 
plete dictionary)}, we obtain 


log 27 


_ p(word)) — ¢; __ p(letter) 
(1 R J=1—R ) x 5.5 log 50,000 


= 1.68 (1 — Rlettery, 


It is thus seen that the redundancy for words is appreciably less than that for 
letters, i.e., "hieroglyphic’ writing is, let us say, more ‘advantageous’ than the 
customary writing by using letters. This position is closely related to the advant- 
age from using direct long block coding of a large number of ‘letters’, of which 
we have much to say in the present chapter; words are also specific ‘blocks’ 
(such ‘blocks’, whose probability of occurrence is comparatively high). 

It is clear that similar arguments enable us also to associate the values of the 
entropy (information) H = H.. and the redundancy R assigned to one text letter 
with the same quantities determined for any other linguistic formation (syllable, 
phrase, morpheme etc.; cf. what is stated below on phonemes). This position 
explains the reasons why an overwhelming majority of information-theoretic 
investigations of a language start from its alphabet /etters. In fact, a relation 
between the values of entropy assigned to one letter, syllable, word etc., allows 
us to confine the consideration to any one of these quantities; on the other 
hand, alphabet letters have the advantages of being familiar, uniquely defined 
(because for a majority of other linguistic formations like syllables, morphemes, 
or even words, there exist no precise definitions, excluding fully different 


¢Since the number of words K appears in the preceding formula under the sign of the 
logarithm, the inaccuracy in determining this number does not significantly influence the 
final result (e.g., if we put K = 100,000, then the factor 1.68 in the formyla that follows is 
changed to 1.58). 
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interpretations of the same concept), and bounded in their number (since the 
‘alphabets’ of words, and especially of the sentences of the language, are prac- 
tically unbounded). 

Let us now note that the relation between the values of H@*!&") ang y{word) 
may be used in two-fold way. This relation enables us to deduce the estimate 
of H!) from the value (supposed to be known) of Fe. However, on 
the other hand, the same relation also permits us to estimate the entropy HU") 
by relying on the approximate values of H (word) obtained by some method. The 
approximate value of Herd) (precisely speaking, the value of first-order en- 
tropy Hiword)) can be calculated, say, by using the so-called Zipf principle, 
which says that when words of a language are arranged in the order of their fre- 
quencies (1.e., probabilities), the frequency of the nth probable word for all not 
too large values of n is found to be approximately proportional to \/n. This 
princlple was formulated and verified through analysis of a large amount of 
linguistic material by Zipf [179]; later it was repeatedly discussed and sharpened 
by several authors.f The Zipf principle has been discussed at length in Chapters 5 
and 12 of [17], in Part 1 of [71] and in the papers [125], [131] where, in parti- 
cular, the graphs borrowed from [179] are reproduced. These works demon- 
strate the applicability of Zipf’s principle to texts written in different languages 
and having different character (say, to the text of Joyce’s novel Ulysses and 
that from one of the American dailies). Shannon [159] was the first to show the 
usefulness of Zipf’s principle in the evaluation of the first-order word entropy 
(and, proceeding from this, even in the approximate determination of the 
entropy of a letter; see pp. 186—88). Similar calculations for Italian language 
were carried out by Manfrinco [128]; for further relevant data in this direction, 
the reader may refer to the papers of Newman and Gerstman [137], Miller [131] 
and Grignetti [105]. 

An approximate estimate of first-order entropy piven) was obtained (with 
reference to the Rumanian language) by Voinescu, Fradis and Mihailescu (see 
third work of [172]) by the formula 


Hiword) = —p, log Pi — De log Po — .+.+— PK log PK. 


Factually, however, this work is devoted to the entropy of not written but spoken 
language (the frequencies p,, p2,..., px are determined here from an analysis 
of magnetophone recording of answers given to a long series of standard quest- 
ions by ten different subjects); hence it has been more appropriately dealt with 
in the next sub-section of the present chapter (see pp. 220—21). Furthermore, 
we note that the basic objective of the studies of Voinescu et al., consisted not 
entirely of the determination of the value Heer) for the ordinary Rumanian 


{Thus even Zipf himself had remarked that in some cases it is more appropriate to consider 
that the frequency of the ath word is in fact proportional to 1 /n¢, where the constant a is close 
to unity, but nevertheless not exactly unity (see in this context also [6], [71], [125] and [127]. 
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language, but of a comparison of the value of penne) associated with the speech 
of healthy persons to the corresponding values associated with the speech of 
another ten subjects, aphasia patients (i.e., those suffering from speech disorder 
caused by some brain disease). Hence it also borders on a study of the statist- 
ical characteristics of different ‘specialized languages’, which we shall presently 
consider. 

The data on the entropy of one letter of text, of which we spoke above, related 
as a rule to the ‘average literary language’, since literary texts mostly serve as 
the experimental] input for determining entropy. Thus, Shannon [159] (working 
in collaboration with his wife Betti Shannon) and also Cover and King [86] 
analyzed the fragments from Dumas Malon’s book, Jefferson the Virginian. 
Moreover, Kolmogorov and his associates used the works of Aksakov and 
Goncharov (see pp. 200—201). But on p. 178 it has been indicated that the 

‘occurrence frequencies of different letters may depend on the character of the 
considered text; exactly in the same way, the values of the entropies Hw or 
redundancy R will be different for texts borrowed from different sources. More- 
over, any ‘specialized language’ (for example, a scientific or engineering text on 
a specific problem, business correspondence, schoolboy slang, any non-custom- 
ary jargon) will, as a rule, have more than average redundancy because the num- 
ber of words being used will be less and special terms and phrases will be 
repeated often. This circumstance is of great advantage, since it highly facilitates 
very fast reading of special scientific literature by the experts and even the read- 
ing of such literature in a poorly known Janguage. Some slangs and scientific 
jargons may be an exception in this connection, if they are used from the especial 
objective of decreasing the redundancy of language. By way of example, we 
may mention thieves’ cant, in which long and meaningful phrases may some- 
times be substituted by extremely short expressions, or some recently innovated 
scientific jargons with enormous detailed terminology like those used in math- 
ematics by the French sthoo! of Nicolas Bourbaki.| An even more striking ' 
example in this direction is provided by the symbolic language of modern 
mathematical logic, characterized by the exceptional richness of sense. 

A number of authors tried to investigate the influence of the nature of the text on the values 
of the entropies of different orders per text letter and the text redundancy. Nevertheless till 
now there are available only a few results of restricted reliability in relation to this problem. 
For example, as already mentioned above, Manfrino’s calculation [128] of letter entropy values 
for the /talian and Portuguese languages were carried out for three different types of text, 
namely that from a scientific book, a history book or novel and a newspaper. However, the 
entropies of orders 1, 2 and 3 only were considered by Manfrino and the values of Hy, where 
N = 1, 2 and 3, obtained for these three different types of text turned out to be very close to 
each other in the case of both the /talian and Portuguese languages. 


More conclusive results were obtained by Newman and Waugh [138]. As already indicated, 
these authors calculated the approximate values of 1 for comparatively large N with the aid 


tA move popular example is analyzed in [135], which has been already mentioned before 
(p. 191). 
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of the method due to Newman and Gerstman [137], which is unfortunately not much depend- 
able. However, this made it possible for them to obtain crude estimates of all the entropies 
Hy up to the twelfth order (and the corresponding redundancy of the twelfth order Ry, = 
1 — H,,/H,). These estimates of Newman and Waugh were derived by evaluating three 10,000- 
letter excerpts sampled from English Bib/e, the writings of American philosopher and psy- 
chologist William James and the modern magazine, At/antic Monthly. According to the results 
obtained, the values of the entropies of a few lowest orders do not differ much for these three 
different types of text, but those of higher order entropies and redundancy differ considerably. 
The prose of the Bible, the simplest of the three samples, is characterized by the lowest value 
of the entropy per letter and the highest redundancy (and also by the lowest value of the aver- 
age word length w; for the data on the value of w, see pages 187 and 196), while the highest 
value of Hy and w and the lowest value of R are attained for the most terse prose from the 
Atlantic Monthly; the writings of William James range between the two other texts in all these 
respects, though they are much closer to the Atlantic Monthly than the Bible. 

Newman and Waugh’s method was slightly modified by Carterette and Jones [83] to study 
the entropies of a few lowest orders and estimate the redundancy in children’s graded reading 
books. Since the text difficulty must increase within a series of children graded readers, it is 
natural to assume that the entropy values would increase and redundancy would decrease 
with increase in the reader’s level. To verify this assumption, Carterette and Jones analyzed 
children's reading texts at levels 1, 2, 3 and 5 and compared them with each other and with 
three types of adult texts which were investigated by Newman and Waugh. A 28-letter alphabet 
was chosen in Carterette and Jones’ study by adding to the ‘customary’ 26 English letters the 
period (the end of a sentence) and the space (the end of a word), but this circumstance is of 
minor importance. The results obtained by them [83] are in good agreement with the expecta- 
tions : they show that the text redundancy decreases progressively from the First reader to the 
Atlantic Monthly, with the Bible being close to the Third Reader in this respect and the Fifth 
Reader approaching William James’ writings intended for adults. 

A mention has already been made of letter-guessing experiments carried out by Weltner 
[173] and Piotrovskii and his coworkers (see [147], [143], 146] and [148]) for various types of 
texts. In particular, Weltner’s book contains the figure and the detailed table showing the esti- 
mates of the entropy (per text letter) for a great variety of texts including poems, a series of 
prose texts (short stories, novels by various authors), excerpts from two different newspapers, 
scientific texts in different fields and a number of usual and programmed textbooks. All esti- 
mates were deduced from the results of letter-guessing by the same group of subjects (students 
of a teacher's college), which demonstrated considerable differences between the entropies of 
different texts. However, Weltner did not attempt to clarify whether these differences were 
statistically significant in all the cases or stemmed from the experimental errors. 

Piotrovskii and his coworkers studied three different types of texts: conversational langu- 
age, literary texts (i.e., fiction) and various business texts (including engineering and scientific 
writings). The results (see [143], [146] and [147]) related to the information-theoretic charac- 
teristics of three types of texts in the Russian and French languages and the averaged results 
(for the language at large) are listed in the tablet on p. 212. In complete agreement with what 
has been stated above they show that the redundancy of ‘business texts’ is appreciably greater 
than both the ‘average redundancy’ and the redundancy of literary texts. However, the 
redundancy of conversational language is found to be slightly lower than the averaged 
redundancy—in principle, this may be due to the ‘liberty’ permissible in conversational langu- 
age which often leads to violations of strict constraints dictated by subtleties of ‘style’ and rules 


fIn the references cited, there are some discrepancies beeween the values of redundancy R 
and the entropy H. However, in the table on p. 212 the values of R have been brought in 
conformity with the values of H taken from the same sources. 
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TABLE 
eS 
AT = Ho (in bit/letter) R (in per cent) 
Russian French Russian French 
Language Language Language Language 

Language at large 1.37 1.40 72.6 70.6 
Conversational language 1.40 1,50 72.0 68.4 
Literary texts 1.19 1.38 76.2 71.0 
Various business texts 0.83 1.22 83.4 74.4 


of grammar. In [148], similar results are presented for seven languages (Russian, French, Eng- 
lish, German, Polish, Rumanian and Kazakhian) and an attempt is also made to invoke some 
rather crude procedures of mathematical statistics to test whether the redundancy differences 
between various texts are real (statistically significant), or fictitious (generated by experimental 
errors). These tests enabled the authors to conclude that the considerable difference obtained 
between the redundancy of literary texts, or conversational language and that of business texts 
is statistically significant for all the studied Ianguages, but a small difference obtained between 
the redundancy of literary texts and that of conversational language is probably due to the 
experimental errors. 

The study by Smirnov and Yekimov [163] bears a more special character- These authors 
investigated a sample from Russian telegraphic texts, the size of which was about 15,000 letters, 
using Shannon’s letter-guessing method (and its variant due to Kolmogorov: see p. 198 et 
seq.). The main result obtained by Smirnov and Yekimov says thus: Hi(telegr. Russian) _ 
1.4H(literary Russian) his result is obviously connected to the deliberate decrease in the 
redundancy of telegraphic text (say, owing to the omission of conjunctions and other ‘evident’ 
words). 

The other highly specialized Janguage, namely, the so-called ‘control tower language’ of 
radio communications between the air traffic controller at airport and the aircraft pilot in air, 
has been studied by Frick and Sumby [97] as well as by Fritz and Grier [98]. The radio com- 
munications considered in these works are naturally quite standard in their form and confined 
to a few limited, constantly recurring topics. Hence, it is no wonder that the redundancy of 
the corresponding language (estimated either by means of ‘guessing experiments’ or through 
a direct statistical study of the collection of a few standard sentences of which these communi- 
cations are made up) is found to exceed considerably the redundancy of average ‘literary 
language.’ In the particular, by confining themselves further to a very restricted class of 
messages transmitted by an airport ‘control tower operator’ to the pilot Janding a plane, Frick 
and Sumby obtained for the redundancy a value close to 96% (almost the same redundancy 
value close to 93% can be deduced from the results obtained by Fritz and Grier). The abnorm- 
ally large redundancy has here a completely transparent justification—because of the 
difficulty in receiving the massage (due to the aircraft noise), a reduction in redundancy may 
lead to erroneous reception, foreboding, in the considered case, disastrous (even tragic) con- 
sequences. Hence the high redundancy is here necessary for air traffic security. 

The fact that a ‘specialized language’ is characterized by high redundancy is used when, 
for instance, one constructs specific codes for the business correspondence of a large firm. 
At present, such codes are developed with the indispensab!e participation of information 
theory specialists, and the presence of many oft repeated standard words and sentences in the 
firm’s correspondence facilitates the increase of the code efficiency considerably. 
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Let us now examine the interesting but so far little studied question of the 
differences in the language redundancy of different /iterary texts. It can be 
presumed that different literary genres are distinguished by different redundanc- 
ies, related to the style intrinsic specifically to this type of composition; it can 
also be conceived that even within one literary composition in different fragments 
(dialogue, description etc.) the redundancies are different. High redundancy may 
characterize the hackneyed, stereotyped language of a literary composition, but 
can also serve as only an evidence of the leisurely style of an author (thus, a 
high redundancy is detected in the experiments mentioned on pp. 200—201 on 
the letter entropy estimation in Goncharovy’s Literary Party, written in a placid 
and flowing language spelling out a large number of quite obvious details). Low 
redundancy may bear testimony to the richness and brilliance (unexpectedness, 
unconventionality) of a literary language (possibly, Faulkner’s language can be 
cited as an example here); an extremely low redundancy in the language of a 
literary composition is invariably interpreted as a deliberate complication of the 
language, however (cf. Finnegans Wake by Joyce). Still lower redundancy will 
have the sort of ‘obscurity’ that was used by the Russian poet Khlebinikoy at 
the turn of the century and became popular among a number of Western poets 
after the second world war (recall that zero redundancy characterizes the ‘sent- 
ence’ mentioned on p. 178, which can hardly be considered as a distinctively 
‘nice’ literary form). 

The allied problem here is that of comparing the redundancies of prose and 
poetic language, widely discussed in the sixties (see [116], [129], [165] and a 
number of papers in the collection [117]; see also the articles of Dolezel and of 
Nicolau, Sala and Roceric included in [128]). It is clear that the poetic form 
characterized by a specific rhythm and rhymes imposes on a language certain 
additional restrictions, i.e., raises its redundancy. An attempt can also be made 
to estimate quantitatively, say, the impact of the rhythm of a verse, by deter- 
mining the quantity of word combinations satisfying a given rhythmic plan, and 
comparing it with the entire store of meaningful word combinations (in the 
determination of such store, it is convenient to use as a base a dictionary of 
prose compositions by the same author).t The impact of rhymes is slightly more 
intricate to calculate, but rough estimates are completely possible here, too. The 
approximate estimates deduced by Kolmogorov for the classical! Russian tetra- 
metric iambic verse (for example, Pushkin’s Evgenie Onegin written in this 
verse)}f show that the fulfilment of the requirements imposed on the poetic form 


See, for example, Kondratov [1!6] in which low order entropy is calculated, determined 
by the Russian poetry of a definite rhythmic plan and by the Russian prose (scientific, business- 
like, fiction, colloquial) texts (in bit/syllable); cf. also Liidtke ‘‘A comparison of metric plans 
with respect to their redundancies” in [117]. 

{tThe tetrametric iambic verse is characterized by a stanza, which theoretically consists of 
eight uniformly alternating accented and unaccented syllables (in practice some accents are 
sometimes shed). 
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reduces the ‘uncertainty’ H.. for one text letter by a quite appreciable amount, 
whose order compares with the half value of H.. calculated for an ‘averaged 
literary’ text. In fact, the corresponding letter-guessing experiments carried out 
by Kolmogorov also show that, for a ‘poor’ verse (in which the decrease of in- 
formation contained in a letter is not compensated for by emotional stimulation, 
brilliance of speech and richness of language characteristic of ‘good’ poetry), 
the ‘limiting information’ H.. per text letter is essentially less than (approxima- 
tely half of) the value of H.. determined for classical Russian prose.t However, 
in the compositions of many eminent poets, the decrease in the information 
content of one text letter, related to the fulfilment of known forma! rules, is 
apparently compensated for to a great extent by the enhanced radiance and 
unconventionality of language. Therefore, it can be well expected that here the 


redundancy of the language has the same order as that of a prose literary text. 

The impact of various factors related to literary style on the value of the entropy and re- 
dundancy of language is considered by Paisely [140]. He made use of the method due to New- 
man and Gerstman [137] and Newman and Waugh [138}, which is not quite reliable but it 
enabled him to analyze 39 different English excerpts and compare the entropies among them. 
The compared excerpts include: (a) two poetic translations from Homer’s /liad due to different 
authors; (6) four translations of two different passages from the same J/iad, and also four 
(modern) translations of two passages from a chapter of Matthew's Gospel (in both cases the 
selected passages differ considerably in content); (c) four prose and four poetic translations 
from Iliad, and (d) nine different translations from Matthew's Gospel relating to different 
periods. In a number of cases analyzed by Paisley the differences among the entropy values 
turned out to be noticeable, and some general regularities could also be noted here (such as 
the progressive decrease of redundancy in literary texts with the {ime of their writing approach- 
ing more nearly modern times). However, all these inferences are still not quite dependable 
and call for further verification. 

The studies (160) devoted to a number of Indian languages are of a nature similar to those 
mentioned above. In these papers, the values of the entropy calculated for texts of different 
character (say, prose and poetic) and different times of writing are also enumerated. Some of 
the results obtained in [160] definitely have something in common with the results obtained 
by Paisley on material writien in the English language. However, a comparison is rendered 
difficult here due to the substantial d'fference between the English and Indian alphabets (see 
the discussion on pp. 197-198). 

Of the works more directly related to the comparison of statistical characteristics between 
prose and poetic language (the question is not lost sight of in [140] and [!60]), the foremost 
to mention are the investigations of Dolezel and of Nicolau, Sala and Roceric (see [[28]) onan 
evaluation of the entropies of various orders for Czech and Rumanian prose and poetic language, 
and even for individual prose writers and poets. However, the preliminary estimates obtained 
by these authors clearly need further sharpening. Marcus [129] made a boldattemptto carry over 
to poetry the relationship between the physical concepts of ‘entropy’ and ‘energy’: on this basis 
he considered some results contained in the studies of Nicolau, Sala and Roceric, concerning 
the calculation of the entropy for M. Eminescu’s compositions, relating to various periods of 
the poet’s creative works. Tarnoczy’s paper [165] has a more special character; it contains 
the evaluation of a number of information-theoretic characteristics of Hungarian prose and 
poetry. 


+The short novel Due/ by the Russian writer A. I. Kuprin was compared to a poem of quite 
poor quality printed on the reverse of one of the sheets of a torn-off calendar. 
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Finally, let us note that weil-justified doubts were and are still entertained about the appli- 
cation itself to literary texts (unique by the very definition!) of conventional information- 
theoretic ideas arising in connection with purely applied problems of communication engineer- 
ing. In fact, information theory ignores the question of the subject matter of the transmitted 
message and relies only on purely statistical concepts (e.g., on concept of letter frequencies in 
a ‘statistical ensemble’ of all ‘average texts’ of a given language; but what can be made out of 
the notion of a ‘statistical ensemble’ consisting of Shakespeare’s tragedies, or Pushkin’s 
poems?). These considerations led Kolmogorov (see [15]) to undertake an extensive formula- 
tion of the problem of the possibility of various approaches to the very notion of ‘amount of 
information’ and suggest a ‘pure combinatorial’ approach to this concept, in particular in 
application to a study of the language entropy and, especiaily, the entropy of literary texts. 

The essence of combinatorial approach to the determination of entropy consists of the 
following. The Shannon entropy H per text letter can be determined subject to the condition 
that, for an n-letter alphabet, the number of different N-letter texts (where N is sufficiently 
large) satisfying the given statistical restrictions is not equal to the number nW¥ = 2l08" - 
(= 20%) to which it should be equal if we have the right to choose any collection of N 
sequential letters, but is equal only to M= oun (see pp. 55—56 and 168—169). In accord- 
ance with this, by using the notion of ‘intelligible’ text, we can determine the entropy H as 


! 
Heomb — _ lim (w log M(N) ) 
com Ns N 


where M(N) is the number of all possible intelligible texts of length N. This definition no 
longer depends on any probability-theoretic concepts. 

In striving to estimate numerically the value of the ‘combinatorial entropy’ Heomb, the 
number MN) may be estimated by means of the calculation of the number of possible exten- 
sions of the text. Expressly, suppose that # is a ‘blank’ word that contains no letter at all; 
further, denote by /( * aa, . .. a;) (or by /(a,a, . . . Gy), where @,, ay,..., 4, are some letters 
of the language under consideration), the number of all possib‘e ‘intelligible one-!etter exten- 
sions’ of a sequence of letters a,a, . . . aj, i.e., the number of such letters x that the fragment 
Q,@, . . . a,x can be extended up to an intelligib’e text. In this case, the value 


Min) = [(#)I(wa,@,)I(*4,aq) . . . I(wa,a, ... AN-1); 


averaged over the number of letter sequences, can be considered as an estimate of the quantity 
MU(N) we are interested in. 

What has been stated above paves the way for a purely combinatorial evaluation of the 
entropy and redundancy of a ‘grammatically correct’ text. The earliest efforts in this direction 
are traced to Kolmogorov and his coworkers (see the first paper of [15]), in which the num- 
ber of possibie text extensions is determined with the aid of the list of words entered in S. I. 
Ozhegov’s Dictionary of the Russian Language. The estimate H = (1.9 + 0.1) bit/letter ob- 
tained here naturally appreciably exceeds the bound on the entropy of ‘literary texts’ indicated 
on p. 20! (since the ‘degree of uncertainty’ of a literary text letter is by no means bounded 
only by the requirements of grammatical correctness). Unfortunately, a more detailed exposi- 
tion of these investigations as well as the results of similar studies, commenced in Leningrad 
by R. A. Zaidman, has not yet been published. 


4.3.2. Spoken Language 


We now. pass on to the problem of the entropy and information contained 
in spoken language already touched upon on pp. 211—212. It is natural to 
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think that all statistical characteristics of such speech depend considerably more 
on the choice of the speakers and on the character of their talk than that observ- 
ed in the case of written ianguage; in fact, written language is, as a rule, more 
‘uniform’ than spoken language. Though according to the data due to Piot- 
rovskii and his colleagues, ‘on the average’ the entropy of spoken language is 
slightly higher than the entropy of written texts, this is undoubtedly not so for 
certain types of speech (cf., say, the example of ‘control tower language’ in p. 212). 
The lower value of the entropy of speech can be explained by the fact that in 
conversation a few words are often repeated many times (there being least con- 
cern about the ‘elegance of style’) and also many ‘superfluous’ words (i.e. having 
no information content)are frequently added; this happens both for facilitating 
the understanding of speech and then just allowing the speaker some time to 
think about what he desires to say next. In particular, the redundancy of 
speech is very large when the level of noise is high (say, in the humming of an 
airplane, in a compartment of an electric train or subway), aud also for conver- 
sations between drunkards, persistently repeating one and the same words and 
expressions (as a rule, not very sophisticated ones); the latter is explained by 
the fact that in this case not only the understanding but also the pronunciation 
of speech is difficult. 

By determining the average number of letters pronounced per unit of time, it 
is possible to estimate approximately the amount of information conveyed dur- 
ing a conversation in | sec; usually it is of the order of 10 bits (this information 
amount naturally depends strongly on the ‘conversation speed’ which can be 
varied quite significantly: ‘very rapid’ speech is almost five times faster than 
‘very slow’ specch).f Thisis in agreement with physiological acoustic data, 
which enables us to estimate the total number of ‘distinguishable sounds’ pro- 
nounced by a person in unit time (see Miller [131]). 

However, this estimate of the information transmission rate for a conversation 
takes account of only the ‘semantic’ information, which is related to the mean- 
ing of the speech and can be extricated also from a write up of the stated 
words. In fact, a real speech always contains, in addition to this, further suffic- 
iently significant supplementary information, which the speaker communicates 
sometimes voluntarily but sometimes also directly contrary to his own desire; 
this supplementary information may even contradict the ‘semantic information’ 
but in such cases it deserves, as a rule, a greater confidence. Thus, from a 
conversation we can judge the temper of a speaker and his attitude to what 
has been stated; we can recognize the speaker, even if it is not indicated 
to us by any other source of information (including here also the ‘meaning of 


We are speaking here obviously not of conversation with exceptionally high redundancy, 
of the sort discussed above; thus, in the case of parleys between the pilot and the air controller 
at an airport, the information transmission rate does not exceed 0.2 bit/sec, i.e., it is much 
smaller than that for extremely slow conversation on general topics. 
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speech’); in many cases we can determine the birth place of a person unknown 
to us by his pronunciation (the latter factor plays an important role in the open- 
ing act of Bernard Shaw’s play Pygmalion); we can evaluate the loudness of 
speech, which in the case of voice transmission through a communication channel 
(telephone, radio) is determined in the main purely by technical characteristics 
of the transmission channel, etc. A quantitative evaluation of all this informa- 
tion is a highly complex problem, which demands a considerably deeper know- 
ledge of language than that available at the present time; in particular, this 
requires vast statistical data of a great variety, which is almost completely lack- 
ing So far. 

An exception in this respect is the comparatively restricted problem of the 
so-called ‘insistense stresses’ emphasizing individual words in a sentence. These 
Stresses also carry a definite information load, which (for the particular case of 
telephone conversations in English) can be estimated quantitatively. The statist- 
ical data required for this were obtained by Berry [76], who analyzed a number 
of ‘typical English telephone conversations’. His data show, in the particular, 
that the stress is usually put on the most rarely used words (that, however, is 
quite natural, since it is clear that anyone will hardly put stress on the most 
common words, say, prepositions, articles, or conjunctions). If we denote by 9, 
the probability of finding a definite word W, stressed, then the average inform- 
ation contained in knowing whether that word is or is not stressed is given by 


—q, log q, — (1 — q,) log (1 — q,). 


Suppose now that p,, po,..., Px are the probabilities (frequencies) of all words 
W,, Wo,..., Wx (here K is the total number of all words used; the probabili- 
ties p;, Psy...» Px, Playing a basic role throughout the language statistics, may 


be found in the so-called ‘frequency dictionaries’, see pp. 186 and 207). In such 
a case, for the average information H contained in the insistence stress, we can 
set up the formula 


H = p,l—q, log gq, — (1 — q,) log (1 — 4,)] 
+ pol—qe log gz — (1 — gz) log (1 — q)] 
+... + pe [—g« log 9g -- (1 — gg) log (1 -- 9x))- 


By substituting here Berry’s data, Mandelbrot [126] calculated that the average 
information, which we obtain by ascertaining on which words the insistence 
stresses are put, is approximately of the order of 0.65 bit/word in the case of 
the English language. 


tThis calculation was set forth in Mandelbrot’s paper presented at the Third Symposium 
on Information Theory held at London in 1955. The paper was withdrawn from the Sympo- 
sium procedings, but included in the Russian translation of these proceedings [126]. Mandel- 
brot’s related paper in English (see [126], second paper) contains [on p. 77) the simplest form 
of Berry’s law yielding the stated calculation but does not spell out the calculation itself. 
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As to the generally diverse ‘unsemantic’ information contained in speech, the 
existing data allow us to give only a quite rough and incomplete estimate of its 
total quantity. Such an estimate was obtained by the German scientist, Kiipf- 
miiller in his interesting study [118] of spoken and written German language, 
which has already been referred to in the foregoing. Kiipfmiiller did not even 
make an attempt to take account of the intricate statistical regularities of into- 
nation, tone of voice, and other peculiarities of speech. His work is essentially 
restricted only to estimation of the ‘zero-order entropy’ H, related to the num- 
ber of different possibilities, and is then offered as a rough guide on the assump- 
tion that the corresponding redundancy is equal to 50%. Together with the 
information given by intonation, Kiipfmiiller has estimated separately the infor- 
mation connected to the individual characteristics of the voice of a speaker and 
has also evaluated the information conveyed by the loudness of the speech; the 
sum of the three quantities obtained here has been associated with the ‘semantic’ 
information contained in the same speech. For an evaluation of the total number 
of identifiable degree of loudness and the total numbers of ‘speech melodies’ 
(the types of intonations determined by a small variation of the basic frequency 
of sound oscillations), the physiological acousticf data have been given; the total 
number of individual voices discernible by a person is roughly determined, so to 
Say, ‘by eye’. It is natural that Kitipfmiiller’s estimates of the ‘total number of 
possible outcomes’ obtained in this way cannot make any claim to a high pre- 
cision; however, since the information is determined by the logarithm of this 
number, even a rough estimate enables us to calculate the amount of information 
to a quite reasonable accuracy (clearly, when the total number of outcomes is of 
the order of 1000, then, for the information to be twice overestimated, it is 
necessary that this number of possibilities be increased 1000 times!), These 
calculations led Kiipfmiiller to conclude that the supplementary information 
contained in the intonation, loudness, and peculiarities of individual voices in 
normal conversation must not be greater than 75% of the ‘semantic’ information; 
in quite rapid and extremely slow speech it forms, respectively, not more than 
30% and 150% of the ‘semantic’ information. (The substantial difference between 
the three values may be explained partially by the fact that in rapid speech 
different voices are considerably less discernible and different intonations are 
much less distinguishable.){t 


tSeemingly, the loudness and intonation may be varied in a continuous manner, so that 
infinitely many different possibilities must be available here. In reality, however, the human 
ear distinguishes only a finite numter of different degrees of loudness and a finite number of 
intonations; we shall have more (o say about this in detail below (see Sec. 4.3.4). 

tfApparently, this position is due to the fact that the nerve channels leading from the 
hearing organs to the brain may transmit during a unit of time only a fixed amount of inform- 
ation (see pp. 249-251). Hence an increase in the ‘semantic’ information transmission rate 
invariably implies a decrease in the transmission rate of other types of information over the 
same channel. 
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In Kiipfmiiller’s work the values of the ‘specific’ entropy and information of 
speech related to one pronounced letter are also given. Factually, however, these 
values have only a conditional character (they are needed just for a comparison 
of speech with written language). In fact, during a conversation individual letters 
are never uttered, but only sounds are pronounced, which differ substantially 
from letters. Hence it is necessary to regard an individual sound, a phoneme, 
as the basic element of speech (in the same sense that a letter is the basic ele- 
ment of written language). Meaningful speech is made up of phonemes in ex- 
actly the same way that meaningful written language is composed of letters. 
Hence, in the transmission of speech over 2 communication channel we have 
only to observe that all phonemes are transmitted correctly. If it is achieved, 
then the meaning of the entire speech will also be conveyed corretly, i.e., no 
part of the ‘semantic’ information will be missed. This leads to the result that 
in all cases, when we are interested only in the transmission of the ‘semantic’ 
information of speech (a majority of cases are so), our concern is primarily 
focused not on the entropy and information of a ‘pronounced letter’ (which is a 
purely conventional notion), but on the entropy and information of one actually 
pronounced phoneme. 

The list of phonemes for a given language is obviously not identical with the 
list of alphabet letters. The total number of phonemes considerably exceeds 
the number of letters, since one and the same letter can be sounded differently 
in different cases (for example, the pronunciation of a vowel depends substan- 
tially on whether or not it is accented; one and the same consonant can be pro- 
nounced with hard and soft sounds and so on). It is necessary to bear in mind 
here that even if in relation to the number of alphabet letters different view points 
may be possible (cf., for example, the footnoteft on p. 192 related to European 
alphabets using Latin letters and the discussions on pp. 194 and 203 on the 
Russian ‘telegraph alphabet’ and Hungarian alphabet, respectively), then with 
respect to a ‘phoneme alphabet’, concerning the very definition of which (see, 
for example, Cherry [6] or Uspenski [170]) there is so far no consensus among 
linguists, the differences between various authors are inescapable. Some pre- 
liminary results about the phoneme statistics and phoneme entropies of the 
English spoken language have been obtained by Black and Denes (see [77]). 
The former calculated the entropies Hj, H, and H, for one phoneme by statist- 
ical data related to a collection of one- and two-syllable English words (which 
obviously still does not characterize the entire English language), the number of 
phonemes considered being 41. The latter author determined the relative fre- 
quencies of phonemes and all their pair combinations (phonemic ‘digrams’) by 
the data related to an ‘average English language’, and by taking the number of 
phonemes as 45 (the entropy H, of one phoneme digram, as it follows from 
the data due to Denes, is given in [93]). Similar statistical results on the pho- 
nemes and phonemic digrams of French language were published by Haton 
and Lamotte [106]. The German scientist Endres [93] made an effort to 
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evaluate approximately the total redundancy of one phoneme of German and 
English speech by using a spectrogram of phonemes (giving the representation 
of a phoneme in the form of a figure on a plane) and then applying rough 
methods to determine the redundancy of the plane figures allied to those used in 
the concluding portion of [112] (which shall be further elaborated in pp. 239-241) 
for an estimation of the redundancy of the letter figures in a typescript text. 
According to his data for both languages, the redundancy of phonemes is close 
to 80-85% (that is, it does not differ much from the redundancy of the letters 
of a written language). The American scientists Cherry, Halle and Jacobson 
[84], who also made use of the findings of a number of Russian linguists, select- 
ed 42 different phonemes in the Russian language. They calculated the frequenc- 
ies of individual phonemes (and also various phonemic ‘digrams’ and ‘trigrams’) 
by using mainly quite obsolete and incomplete data given by the well-known 
Russian philologist, A. M. Peshkovskii [142]. Starting from these data, they 
determined the values of the ‘maximum possible entropy’ Hy = log 42 for one 
phoneme, the first-order entropy H, == —p, log p, — p, log py — ... — Paz log 
Pas (where pj, po, .. . , Paz are relative frequencies of different phonemes), and the 
second- and third-order ‘conditional entropies’ H, and H, (defined in exactly 
the same way as for the written language). The results obtained (in bits) [84] 
are listed in the accompanying table. 


A, A, H, H, 
log 42 = 5.38 4.77 3.62 0.70 


It is instructive to compare these values with the values of the letter entropies 
Ho, Hi, H, and Hy given on p. 194 for the Russian written language (see also 
p. 196). The comparison shows that if only the data in [84] are justified,tt 
then the decrease of a number of the conditional entropies for phonemes takes 
place appreciably more rapidly than in the case of the written text letters. 

A study of low-order entropies in the Rumanian speech (and a comparison of 
the data obtained with that related to the written language) has been carried 
out by Fradis, Mihailescu and Voinescu [96]. Let us finally mention the papers 
of Voinescu, Fradis and Mihailescu [172], which are devoted to a comparison 
of the information-theoretic characteristics (the entropies H, and H, for one 


tA considerably more extensive study of the frequencies of individual phonemes and their 
Pair combinations (based on vastly extensive modern material) has been carried out in the 
Department of Phonetics of Leningrad University (see Zinder [!78]). In this investigation the 
total number of phonemes is taken as 48 (in the first place, at the expense of a more detailed 
demarcation of vowel sounds). 

ttUnfortunately, [84] does not give an indication of the exact volume of the material used 
for the frequency deiermination of different phonemes and their binary and ternary combina- 
tions. Hence, it can be apprehended that the value determined of #, is strongly understated 
because of the insufficiency of statistical data (cf. ths footnote on p. 228). 
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phoneme, the difference H, — Hj, and also the entropy Hivord) see p. 209) of 
the speech of healthy persons and that of the aphasic persons (i.e., those suffer- 
ing from some brain disorder affecting the speech). It turned out here that the 
entropies H,, H, and | a all assume appreciably lower values for the speech 
of an aphasic patient than for that of a healthy person (i.e., the redundancy of 
speech is increased here considerably) and, in addition, the stated entropies, as 
a rule, also differ more sharply for different aphasic persons than for different 
healthy persons (the especially sharp character of the indicated phenomenon was 
observed in application to the quantity H. Word) which essentially depends on the 
size of the speaker’s vocabulary and the extent of uniformity with which words 
from this vocabulary are used by him). 

With the aid of the arguments employed in the foregoing for the determination 
of the redundancy ROvord) the relation between the redundancies of spoken and 
written languages can also be established. The fact that speech can be written 
down and written languages can be spoken implies that the ‘total information’ 
contained in a specified textt does not depend on the form, whether spoken or 
written, in which this text is presented. Hence 


Hiletter) x number of letters = H(Phoneme) xX number of phonemes 


(see p. 207). Consequently, it follows that 


(phoneme) = yi letter) xo 
fora) oo > 


where is the average number of letters per phoneme (‘the average phoneme 
length’). The quantity » is an important statistical characteristic of a language, 
which connects the spoken and written languages. From the preceding formula 
it also follows that (cf. pp. 206 and 208) 


(Phoneme) Hiletter) log k 
(phoneme) _zy(letter) meas 1 ; 
Ay eb og n 


or 


__ p(phoneme)) _ (1 __ pi(letter) log n 
(1—R )= (1 —R )Xo@ log k 


> 


where k is the total number of phonemes, and n is the number of letters; here it 
is natural to take R'Without space) fo, Rileiter), However, the difficulty encount- 
ered in the use of this equation is the absence of statistical data, which could 
permit the determination of the quantity w (even with regard to the number 


tApparently, in the case of speech only the ‘semantic’ information contained init is con» 
sidered (see p. 216). 
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of phonemes, there is so far no consensus of opinion among philologists).+ 
4.3.3. Music 


A study of the same sort can also be carried out with respect to musical 
messages. It is natural to think that there is a quite strong bond among the 
sequential sounds (i.e., sequential note symbols) of a given melody. Some note 
sequences that are more melodious than others occur more frequently in musical 
compositions than the other ones. If we write out randomly a number of notes, 
then the information contained in each note of this entry will be the largest; 
however, from the viewpoint of music such a chaotic sequence of notes will be 
of no value. In order to obtain a tune pleasing to the ear, it is obviously 
necessary to insert in our sequence a definite redundancy. It may, however, be 
apprehended in this connection that, in case the redundancy is too large so that 
the succeeding notes are defined almost uniquely by the preceding ones, we 
obtain a most monotonous and uninteresting piece of music. Then, what is the 
redundancy under which ‘pleasing’ music can be obtained? 

It is highly likely that the redundancy of simple tunes be of the same order as 
the redundancy of intelligible speech. It would be of great interest to study 
quantitatively the redundancy of various forms of musical compositions or 
compositions by various composers. Unfortunately, at present we have very 
little concrete data of this sort. One of the earliest results in this direction was 
obtained in 1956 by Pinkerton [145], who analyzed from the standpoint of infor- 
mation theory an album of popular American nursery rhymes. For simplicity 
it was assumed in this work that all sounds are within the range of one octave; 
furthermore, since the so-called chromatic scales do not occur in the considered 
tunes, all these tunes may be reduced to seven basic sounds: do, re, mi, fa, sol, 
la and si (which correspond to the white keys on a piano). All the analyzed 
songs were set up as a sequence of the ‘basic elements’, each with a range of 
one beat (an eighth note). To the seven notes of an octave additional eighth 
‘basic element’ O was added for signifying rest or holding of a note for more 
than one beat. Thus, the ‘maximum possible entropy’ H, of one note is here 
given by 


H, = log 8 = 3 bits. 


By calculating the frequencies (probabilities) of individual notes in all 39 


*By associating English phonemes with the 43 phonetic symbols used in Anglo-Russian 
dictionaries, widely prevalent in the USSR, we can determine approximately the ‘average 
phoneme-length’ w by a comporison of the length of English words written in lelters and their 
phonetic transcriptions. Then, we obtain w = 1.2, yielding 


(phoneme) 


log 26 ~ 1.04 (1 — aictt). 


(letter) 
y=(1-R ) x12 ae } 


(=k 
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analyzed tunes, Pinkerton found that 


H, = —p(O) log p(O) — p(do) log p(do) — p(re) log p(re) 
— p(mi) log p(mi) — pt fa) log pt fa) 
— p(sol) log p(sol) — p(la) log p(la) 
— p(si) log p(si) = 2.73 bits; 


here, for example, p(do) denotes the probability of the note do. By applying 
the probabilities for combinations of two notes determined by Pinkerton, the 
conditional entropy H, can also be calculated; it turns out to be close to 2.42 
bits. (However, let us note that Pinkerton’s paper contains only the somewhat 
averaged values of two-note combination probabilities, so that the obtained 
value of H, is overstated.) It is clear that by means of the values of H, and 
A, alone there is very little that can be stated about the degree of redundancy 
of the considered melodies (it can only be said that obviously it is appreciably 
higher than 1 — (2.42/3) = 0.2). Some indirect data that verify this conclusion 
are given below. 

Even before the appearance of Pinkerton’s paper, the work of F. and C. 
Attneave, calculating the frequencies of individual notes and two-note combi- 
nations in a number of American cowbow songs, was reported at the Conference 
on Information Theory held in London in 1955. A considerably more detailed 
study of this sort was accomplished in 1957 at the Computer Department of 
Harvard University (see Brooks et al. [80]). Here excerpts from 37 hymn 
tunes of different composers and periods of origin, having the same metric 
structure, were analyzed. The use of a high speed electronic computer enabled 
the authors to dispense with the simplification that consists of referring all notes 
to one and the game octave; the distinct ‘basic elements’ considered here were 
all the notes of the four octaves in the chromatic scale (including also the five 
intervening sounds, corresponding to the black keys of the piano). Thus, Brooks 
et al., considered in all 49 distinct elements, and they also included here the 
special notations for sounds extended from the preceding time interval. The 
unit of duration of one basic element was again chosen as an eighth note, since 
shorter notes were not encountered in any of the considered hymns, 

Brooks et al., calculated with the aid of modern computers the frequencies of 
all individual ‘basic elements’ and all combinations of two, three,..., eight 
such adjoining elements. The results obtained yield in principle the possibility 
of setting up approximate expressions for all conditional entropies from Ho, 
H,, H, up to Hg, inclusive. Truly, it is necessary to bear in mind in this connec- 
tion that the statistical material used (consisting of 37 small excerpts from 
different hymns) is surely inadequate for obtaining any reliable estimate of the 
probabilities of combinations of a large number of notes; hence, the values of 
higher order entropies (the entropies H;, H, and Hy, in every case) determined 
jn this way have little validity. Nevertheless, the values of the first few condj- 
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tional entropies may be of positive interest; hence we only regret that the authors 
did not produce in [80] the results of the corresponding calculations (and they 
also adduced no such data as would permit an estimate of the corresponding 
entropies). 

A similar analysis of the melodies of the noted American composer Stephen 
Foster (1826-1864) was carried out (although on a modest scale) by Olson and 
Belar [139]. These authors considered the 11 most popular songs of Foster, 
and by putting the musical scale on the basis of 12 different notes (covering 
one and a half octaves), they calculated the frequencies (i.e., the empirical values 
of the probabilities) of each individual note and all possible groups of two and 
three succeeding notes. It is clear that starting from the data obtained it is 
possible to estimate without difficulty the entropies H, and H, and even the 
conditional entropies H, and H, for one note in Foster’s songs (although this 
was not accomplished in [139] either). Further information on the studies of 
musical composition statistics may be found in Zaripov’s book [177], which also 
contains an extensive bibliography. 

Examples of direct valuations of the information-theoretic characteristics of 
different musical compositions are available in the papers of Youngblood [176], 
Cohen [85], Siromoney and Rajagoplan [162], Hiller and Beauchamp [107], 
Roland [153] and some others (see also a review of this topic given in Chaper 13 
of [17]). Thus, for example, in [85] (in which the results due to Youngblood 
and Brawly are also used) the values of the entropies H, and H, and the corres- 
ponding redundancies R, = 1 — (H,/logn) and R, = 1 — (H;/log x) of the first 
two orders, related to one note, are calculated and compared among themselves 
for the nineteenth century musical material of individual romantic composers 
(Schubert, Mendelson, Schumann) and the German romantic music at large, 
and also for a collection of Catholic religious hymns and modern American 
rock and roll music. In [153] the values of redundancies for the classical music 
of Haydn and modern music of Schénberg are compared (it is natural that in 
Schonberg’s music the redundancy is found to be less than that in Haydn’s 
music). In [107] some results are given of the analysis of one of the composi- 
tions due to Webern, which is simillar to Schénberg’s compositions, and in 
[162] the values of H, are calculated for a number of compositions of South 
Indian (Karnatic) music of the eighteenth and nineteenth centuries. In [85] 
and [107] are adduced also some data with respect to the ‘rhythmic redundancy’ 
of different musical compositions (similar to the redundancy of ‘metric rhythm’ 
in verse). However, so far all the estimates obtained concerning the informa- 
tion characteristics of musical compositions ought to be considered to be prelim- 
inary and the methods used for their calculation still require deeper investiga- 
tions (this is emphasized particularly in the concluding part of the paper [85]). 

Note also that the basic objective of statistical probability calculations, des- 
cribing musical structure, in many cases does not at all consist of the deter- 
Mination of the entropy and redundancy. The fact is that the high degree of 
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redundancy to be found in high class music compositions allows us to give an 
entirely different, quite unexpected, application of statistical tables, which define 
the probabilities and conditional probabilities for different notes, In order to 
approach this application, recall the ‘approximations of different orders’ of an 
English sentence presented on pp. 178 and 181-184, i.e., the sequences of 
English letters, in which to a greater or lesser extent the intrinsic connection 
existing in the English language between adjoining letters was taken into account. 
It was seen that the farther we extended those relations, which are taken note 
of in the composition of our sentences, the ‘more English’ these sentences 
became, i.e., they tended to become closer in sound to the ordinary English 
language. However, it can obviously hardly be expected to obtain in this way 
completely meaningful expression, since there always exists some element of 
randomness in our sentences, which confuses their sense. Let us now attempt 
to apply the same methods to music. Here we shall obtain ‘musical sentences’ 
(i.e., the sequences of notes), all increasingly closer in their statistical structure 
to those sources which are used for the calculation of frequencies of different 
notes and their combinations. As in the case of ‘models of English sentences’, 
these new ‘musica] sentences’ clearly shall not exactly repeat any of the sequences 
from a sample used for the calculation of frequencies. However, whereas in 
the case of language this situation makes our ‘sentences’ senseless, in the case 
of music it is expressly this which makes them remarkable; in fact, they represent 
new, original musical compositions! 

Apparently, it is difficult to announce in advance the extent to which such 
‘models of musical melodies’ may be of interest; it is also not clear to what 
extent statistical relations ought to be taken into account for obtaining composi- 
tions close ‘in spirit’? to the original material (i.e., for example, whether to 
imitate the compositions of a specific genre or of a particular author). It is, 
however, essential to note that by virtue of the appreciable redundancy of music 
we can arrive at sufficiently harmonious sounds via one of the earliest steps of 
the process described on p. 178 et seq. This was also shown convincingly in 
the earlier purely amateurish experiments of Pinkerton [145]. In these experi- 
ments, a note was taken of only the probabilities of individual notes and two- 
note combinations, which were furthermore strongly approximated. In part- 
icular, all the probabilities were rounded off to convenient fractions so that the 
choice of the next note could be made every time by drawing a card from a small 
collection of playing cards. Moreover, a simpler and cruder note-guessing 
procedure was suggested by Pinkerton which reduced all the probabilistic choices 
toa series of binary choices, so that at each step selection could be made simply 
by flipping a coin. Besides, by imposing auxiliary relations that assure the 
conservation of a definite rhythm of ‘musical sentences’ Pinkerton could obtain 
several new tunes which, according to the author’s assertion, are sometimes not 
inferior to the original nursery tunes from the album used by him. The nota- 
tion of one such ‘randomly obtained’ tune is given below (cf. [145], p. 84): 
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The redundancy of this tune can be calculated with comparative ease by 
starting from the statistical laws used for obtaining it; it is found to exceed 63%. 
In the words of Pinkerton, “‘this tune is highly monotonous, but nevertheless, 
less Monotonous than some actual nursery tunes.” Hence it can be inferred 
that in actual nursery tunes the redundancy is probably of the same order. 

Similar attempts to obtain new melodies by means of experiments in which 
cards were drawn from an urn were carried out by F. and C. Attneave in rela- 
tion to cowboy songs. Here also only the probabilities of individual notes and 
two-note combinations were considered (i.e., ‘sentences’ of the sort mentioned 
on p. 182 were constructed) and in addition it was required as well that a 
definite rhythm be preserved. The only difference from Pinkerton’s work con- 
sisted of the fact that it turned out to be more convenient to compose the 
cowboy melodies from their ‘end’ by using the computed conditional probabi- 
lities of the notes preceding some given note. As shown at the London Con- 
ference on Information Theory, among several tens of ‘random musical sent- 
ences’ composed by Attneaves, two were found to be apt, which resemble the 
genuine cowboy melodies. The comparatively small percentage of success is 
naturally explained by the fact that only the simplest statistical regularities of 
the considered songs were taken into consideration. 

The basic goal of Brooks et al. [80] was the same, namely to compose new 
melodies by means of ‘random experiments’. In the given case only the ‘draws 
of cards from card collections’ were effected automatically by an electronic 
computer; operations of this type are found to be highly fruitful in many cal- 
culations using such computers (the so-called Monte-Carlo methods), and at 
present there exist well-developed methods for their automatic accomplishment. 
The immense potentialities of modern high-speed computers are demonstrated, 
in particular, by the fact that Brooks et al., were able to compose all possible 
‘models of musical sentences’ from ‘first-order approximation sentences’ in 
which only the relative frequencies of the appearance of individual notes (of 
the sort of the ‘English sentence’ mentioned on p. 181) were considered, and 
up to ‘eighth-order approximations’ inclusive, in which the frequencies of all: 
possible sequences of eight notes were taken into account. For the composi- 
tion of ‘nth order’ sentences (where in different experiments n takes the value 
1, 2, 3, 4, 5, 6, 7 or 8) a definite ‘rhythmic scheme’ was preassigned each time 
(relating to the distribution of durations of notes and rests), and then all notes 
were successively chosen ‘at random’ but in conformity with the computed 
frequencies of the different combinations of n notes. If subject to such choice, 
the given ‘rhythmic scheme’ was found to be unsatisfied, then the correspond- 
ing note was rejected and the computer automatically repeated the procedure 
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of ‘random choice’; if 15 consecutive attempts resulted in ‘rejected notes’, then 
the computer was shut down and the composition of the entire series of notes 
was started afresh. In all nearly 600 ‘new hymns’ were composed in this way 
(out of a total number of attempts of the order of 6,000); the high percentage 
of failures is explained by the fact that for some values of n (in particular, for 
n= 5 and nm = 7) it turned out to be very difficult to satisfy the rhythmic 
scheme. Examples of melodies constructed with n = 1, 2, 4, 6 and 8 are listed 
below. For = 1 and n = 2 the ‘melodies’ constructed contained many odd 


combinations of notes and unnatural intervals; these ‘melodies’ are difficult to 
sing in spite of the presence of a rigid rhythmic scheme. Form = 4 andn=6 
they tend quite appreciably to sound like ordinary hymns. In the case ofa = 8, 
the ‘compositions’ of the computer reduced to nonoriginal compilations : 
the rather lengthy parts of ‘melodies’ obtained coincide completely with frag- 
ments of one of the hymns and it is just occasionally (in places where two or 
more of the 37 hymns considered have the same groups of 7 notes) that a 
passage from one hymn to the other takes place (in particular, the fragment 
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written above is formed of a portion of three different hymns; the transition 
places are indicated by the brace appended below it). This position stems from 
the small volume of material utilized for the compilation of the frequency table, 
which naturally led to exceptionally high redundancy.t The fact is that many 
combinations of 8 notes did occur just once in the analyzed fragments of the 
hymns; hence for m = 8 many notes in succession were found to have been 
chosen from a single hymn. 

Endeavours in an allied direction are also described by Olson and Belar [139], 
who make use of an analysis of the frequencies of individual notes, their 
pairs and triples in Foster’s songs in order to evolve a special ‘computer-com- 
poser’ to compose (and then even play) quite simple musical compositions, 
reminiscent of Foster’s melodies. Lately, experiments on the computer compo- 
Sition of artificial musical melodies by using the appropriate statistical analysis 
data have received a great impetus in a number of countries. For example, 
in the USA ‘computer created’ melodies are regularly played over radio and put 
on records or tapes, which are offered for sale. However, we shall not dwell here 
upon the indicated experiments, which are only indirectly related to a study of 
the information-theoretic characteristics of musical texts, but refer the inter- 
ested reader to Pierce [17, Chap. 13] and Zaripov [177], who have considered 
all these experiments in great detail. 


4.3.4. Transmission of continuously varying messages. Television images 


Before we proceed further, we shall emphasize a fact that is of great import- 
ance for both theoretical and practical information transmission through com- 
munication channels. It is clear that spoken language or music differs principally 
from written language in the following respect: here the ‘possible messages’ 
are not sequences of symbols (‘letters’), which can take a finite number of values, 
but are collections of sound vibrations, which can vary in a continuous manner. 
Hence, strictly speaking, it is necessary to consider that each sound can have 
infinitely many ‘values’; but in that case all the formulas of our book become 
inapplicable. In the foregoing, we circumvented this difficulty by resorting to a 
decomposition of all sounds of the spoken language into a finite number of 
phonemes, and all musical sounds into a finite number of notes. But, is this 
legitimate? 

For an answer to this question it is necessary that the decomposition invoked 
be understood in the true sense. The point consisis in this that if we are 
interested in just the ‘semantic’ information contained in speech, then we can- 
not take notice of every variation of the speech sound if it does not obstruct 


fNote that in any fragment, in which no N adjacent notes (or letters, or phonemes) are 
repeated, the entropy Hy is zero, i.e., the redundancy calculated with respect to Hy is unily. 
Hence reliable determination of the conditional entropy Hy for large N involves the use of 4 
yast amount of statistical material, 
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our understanding of what is said and does not alter its meaning. Hence we 
can fully combine a majority of sounds that are similar among themselves if 
only the replacement of one of them by the other does not alter the meaning 
of what has been said. But a phoneme is also actually precisely such a collec- 
tion of sounds close to each other and having the same meaning value (con- 
versely, in speech the replacement of one phoneme by another can alter the 
meaning of a word; this property often forms the basis of the definition of a 
phoneme), Hence, clearly, when considering the problem of the ‘semantic’ inform- 
ation contained in speech, we ought to consider that the ‘basic elements’ of 
Speech are not all sounds that are different among themselves (whose number 
is obviously infinite), but only a few ‘intelligible sounds’ having different mean- 
ings, i.e., phonemes. Exactly so is the case of music; if we are interested in 
just the information contained in the performed composition, but not in the 
interpretation of the composition by a certain performer, then it is necessary 
to identify all sounds that are expressed by the same sequence of note symbols, 
i.e., to consider only a finite number of different ‘basic sounds’ corresponding 
to a finite number of existing notes. . 

But one can pose an even broader problem. In particular, in the case of 
speech, besides ‘semantic’ information one can consider also the information 
contained in the intonation as well as the tone of the voice, and in the case of 
music one can be especially interested in the peculiarities of a given individual 
performance (the transmission of these peculiarities is a very important problem 
in communication engineering). The question is whether it is necessary in this 
case to consider that every sound can take an infinite set of values and hence 
can have an infinite entropy. A negative answer to this question has already 
been given on pp. 218-219, where we have deduced an evaluation of the entropy 
of spoken language with regard to different forms of ‘unsemantic’ information. 
We shall now undertake a more elaborate discussion to clarify this fact. 

It is certainly true that the loudness of sound or the pitch of tone can be varied 
continuously, i.e., can take an infinite number of different values; moreover, in 
principle these values can replace each other as quickly as desired. However, 
our ear can distinguish only sounds that do not occur in extremely rapid succes- 
sion; hence it can be considered that all sounds that we hear have a definite 
minimum duration. Moreover, we can discern only such sounds as differ in 
loudness and pitch by a bound not less than a certain definite.finite value, and 
we cannot grasp a sound that is too high, or too low, or too soft, or too loud 
(loud sounds deafen us). Hence it follows that in fact only a finite number of 
scales of loudness and pitch of tone are distinguishable. By identifying on this 
basis all sounds, whose loudness and pitch of tone are determined to be within 
the range of one scale, we again arrive at our familiar sequence of signals, 
which can take only a finite number of different values. 

’ The extremely general situation considered here is quite similar to the one 
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we were confronted with in the solution of Problem 22 in Section 2.3 (p. 80). 
There also we encountered the case of experiment #, having an infinite number 
of possible outcomes; however, it was found that for solving the problem experi- 
ment 8 can be replaced by a new experiment §,, obtained from § by identifying 
all its outcomes, which differ from each other by less than a small number e. 
The entropy H, of 8, (in contrast to the entropy of experiment 8 itself, H, 
is a finite quantity) is called the e-entropy of B. In all problems concerning 
the transmission of messages, which are represented by continuously varying 
quantities, the e-entropy occupies a very important place. In the transmission 
of such messages, a collection of all possible values of the signals to be trans- 
mitted is always partitioned into a finite number of scales (‘cells’ in the space 
of values) and all values within the range of one scale are identified among 
themselves (for instance, they are considered to coincide with the ‘centre’ of the 
corresponding cells). This operation of replacing a continuous message by a 
new message that takes only a finite number of possible values is called in 
communication engineering the quantization of a message. A quantized message 
always has a finite entropy (representing one of the variants of the e-entropy 
of the original continuous message) that depends on the choice of the quanti- 
zation method applied, but characterizes also the degree of uncertainty of the 
original continuous message. The latter circumstance decides us in favour of 
the possibility of using corresponding quantities in communication engineering. 

An important class of such continuously varying messages is the images trans- 
mitted through television or phototelegraphic communication channels, It is 
easy to comprehend that principally we have here the same position as in the 
case of sound transmission — our eye is capable of distinguishing only a finite 
number of brightness grades of pictures_and only those elements that are not 
too close to each other. Hence any picture can be transmitted ‘through points’, 
each of which is a signal taking only a finite number of values. In the case of 
phototelegraphy, in many cases we can consider that each ‘elementary signal’ 
(i.e., the smallest distinguishable element of a picture, the point) takes only 
one of two values, namely ‘white’ or ‘black’; in black-and-white television it is 
necessary to take account of a considerable number (several tens) of grades of 
darkening (‘brightness levels’) for every element. In addition, phototelegraphic 
images are stationary, but on a television screen 25 still pictures are shown 
every second one after the other to create an effect of ‘motion’. In both cases, 
however, no outcome of experiment «9, which consists of determining the value 
of a continuously varying image hue or brightness (varying from point to point, 
and in the case of television, varying in time also), is actually transmitted over 
a communication channel but rather the outcome of an altogether different 
‘quantized’ experiment «,, which consists of determining the colour (black or 
white) or luminosity scale for a finite number of ‘points’. This new experiment 
a, can have only a finite number of outcomes, and we can measure its entropy 
H (which is essentially a variant of the e-entropy of the original experiment a). 
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The total number of elements (‘points’), into which a picture has to be decom- 
posed, is determined in the first place by the so-called ‘resolving power’ of the 
eye, i.e., its capacity to distinguish similar sections of a picture. In modern 
television, this number is usually of the order of several hundred thousands 
(in Russian telecommunication, a picture is decomposed into 400,000—500,000 
elements, in American into approximately 200,000-300,000, in the transmis- 
sion at certain French and Belgian television-centers into almost 1,000,000). 
It is easy to understand that for this reason the entropies of television images 
have vast magnitude. Thus, even if it is assumed that the human eye different- 
iates only 16 different ‘brightness levels’ (the value is evidently too low) and 
that a picture decomposes into altogether 200,000 elements, then the ‘zero- 
order entropy’ is found here to be Hy = log 167° — 800,000 bits. The value 
of the true entropy H is obviously less, since a television picture has the signi- 
ficant redundancy R = 1 — (H/H)). Indeed, while calculating H, it has been 
assumed that the values of brightness at any pair of ‘points’ of a picture are 
independent of each other, whereas in fact the brightness usually varies very 
little in the passage to the adjacent elements of the same picture (or even a 
different but closely following one). The descriptive meaning of the redun- 
dancy R is that, among our 16299 possible combinations of brightness values 
at all points of a screen, the sensible combinations, which can be called ‘pictures’ 
form only a negligibly small part. An overwhelming majority of these combin- 
ations make up a completely disordered collection of points of different bright- 
ness, far removed from the ‘subject’ whatever it may be. On the other hand, 
the real ‘degree of uncertainty’ H of a television picture should obviously take 
note of only those combinations of brightness values that have at least some 
chance of being transmitted, and not all general combinations of brightness 
values. f 

The determination of an exact value for the entropy H (or redundancy R) 
of a television picture demands a penetrating study of the statistical relations 
between the brightness of different screen points. This problem is quite involved 
and at present we have just a few relevant particular results. Schreiber [157] 
has measured, in particular, the values of the entropies H,, H,, H, and H, fora 


fOne should not merely think that the extreme scarcity of ‘sensible pictures’ automatically 
implies the redundancy R to be necessarily quite large. In fact, by assuming, say, that the 
human eye differentiates in all 10 different scales of brightness (so that the total number of possi- 
ble brightness combinations is 107°) and that the ‘sensible pictures’ (which for simplicity 
are considered to be equally probable) form in all 0.00 . . . 01% (where 1997 zeros after a deci- 
mal point occur!) ofall possible brightness combinations, it is easy to find that the redundancy 
R is close to 1 — [(200,000 — 2,000)/200,000] = 0.01 = 1%, i.e., it is extremely small (if the 
number of distinctive brightness scales were increased, then it would be siill smaller). This 
apparently unexpected result is explained by the extremely slow variation of the function log n 
for large values of n, already mentioned on p. 208 (in connection with the evaluation of hiero- 
glyphic writing) and on p. 218 (in relation to the estimation of ‘unsemantic’ information of 
speech). 
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number of television subjects of varying complexity, but he published the results 
for only two of them, of which the first (picture A, representing a landscape with 
trees and architectural structures) is the most complicated, and the second (picture 
B, representing a rather dark gallery with passers-by) is the most monochromatic 
in colour and contains the least details. Schreiber distinguished 64 different 
brightness levels of an element of a television picture; hence the entropy Hp (re- 
lated to one element, but not to the whole image) is found to be Hy = log 64 
= 6 bits. Furthermore, with the aid of a special engineering device he calculated 
for both the considered pictures the relative frequencies (probabilities) p,, po,..., 
Pa Of all differentiable brightness levels and defined corresponding ‘first-order 
entropy’ by 


A, = H(%,) = —p, log p, — pe log pp —.. - — Pes 108 Pos 


(note that a direct calculation of p,, pe, ..., Pa, can hardly be accomplished 
without the mediation of radio engineering when the total number of screen 
elements is of an order 200,000). The same engineering device was then applied 
for calculating the relative frequencies p;; of the adjacent (horizontal) pairs of 
elements, in which the first (second) element has the ith (jth) brightness value, 
as well as the relative frequencies pj, of the adjacent (here also only horizontal) 
triples of elements, in which the first, second and third elements have, respectively, 
the ith, jth and kth brightness value (the numbers i, j and & run through all 
values from | to 64). These frequencies enabled him to determine the ‘entropies 
of compound experiments’: 


A(a,%2) = —Py1 10g pur — Prz log Pr2 — . . « — Poasaa 108 Persea, 
and 
F(a, 93) = —Pyri 10g Parr — - » » — Potyoayaa LOG Pas,aasea> 
and then also the conditional entropies: 
H, = Ha,(%) = H(a,%,) — H(a,), 
and 
Hy = Hayua(%) = H(%%2%3) — H(%,%2), 


though H, was calculated only for picture B. The results obtained are tabulated 
below: 


A, A, HH, A, 
Picture A 6 5.7 3.4 = 
Picture B 6 4.3 1.9 1.5 
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From the table it is seen that the entropy H, does not differ very much from 
Hp, it being appreciably larger for the picture A than for the picture B (this is 
obviously due to the greater monochromation of B in comparison to A). The 
conditional entropy H, (i.e., the average ‘degree of uncertainty’ of the brightness 
of a screen element when the brightness of the adjacent horizontal element is 
known) differs substantially from Ho; this also is remarkably lower for B than 
for A, which corresponds to the abundance of detail being less in B. The re- 
dundancy R, estimated with respect to H, [i.e., the difference 1 — (H2/H)] for 
A is 44% and 68% for B; the real value of the redundancy can only be larger 
than this. As to the conditional entropy H;, when the brightness of two pre- 
ceding elements of the same line is known, it differs comparatively less from H, 
(its corresponding redundancy value for B is 75%); hence we can conclude that 
by knowing the brightness of the closest elements we determine a very consider- 
able part of the total redundancy. 

The works of Lebedev and Piil [121] (see also [122]) and Limb [123] are also 
of a similar nature. In [121] and [122] some results are deduced from the calcu- 
lations that are based on the use of statistical material slightly poorer than in 
[157] and a division of possible values of the brightness of an element of a tele- 
vision picture into 8, but not into 64 scales. These results include the evaluation 
of the entropies Hy and H, and a number of conditional entropies H,, Hj, and 
H, of a single element of the image for the following four television sport 
features: (A) fast running basketball players, (B) close-up of a spectator 
in the grandstand; (C) a panoramic view of spectators in the grandstand, and 
(D) fast running football players. Let us denote by the digits 1 and 2 the image 
elements adjacent to the given element 0 in the horizontal and vertical direction, 
by 3 the adjacent diagonal element, by 4 the same element as the given one but 
considered in the preceding television transmission frame, by 5 an element at 
the same line adjacent to element 1 and, finally, by 6 the same element in the 
frame, preceding the one which contains element 4 (see Fig. 16a). We set up in 
upper parentheses of conditional entropy notation the image element numbers, 


Pstafifo, 


(5) 


Fig. 16, 


whose brightness level is assumed to be known. In such a case, following [121] 
(see also [122]) the values of various entropies (in bits) can be presented in the 
form of the accompanying table. 
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SD 


H, Hi, He) Hi?) Hi4) A{®) 
aS ich a a ee 
(A) 3 1.96 0.69 0.98 — 1.77 
(B) 3 1.95 0.36 0,39 = = 
(C) 3 2.78 1.34 1.95 2.78 = 
(D) 3 2.45 = — 2.00 2.08 
His) Hi) Fy) Hits) H{124) 
(A) 0.68 — 0.56 re = 
(B) 0.35 ee 0.27 0.26 = 
(C) — — 1.22 1,18 1.19 
(D) — 1.83 a =a - 


(a dash in this table signifies that the corresponding entropy has not been com- 
puted). The following four parts (each containing 5,000 individual elements) of 
two television images are analyzed in [123]: (A) the earth surface of an average 
setting covered with grass and bushes; (B) a part of the same landscape, adjacent 
and similar to (A); (C) a part of sky with clouds of comparatively uniform light 
hue; (D) close-up of a large grassy area with bushes. The image elements are 
divided into 16 brightness levels; for calculating the conditional entropies of an 
image element with number 0, the data related to the elements 1, 2, 3, 4 and 5 
of the same and preceding lines of the same frame (see Fig. 16b) are used. 
The results obtained in [123] are listed in the accompanying table. 

came) 


H, A, RY) HE) Hg) A ys28) vt) 

(A) 4 2.85 2.24 2.38 1.82 2.10 1.46 1.47 
(B) 4 2.51 1.99 1.96 166 1.66 1.15 1,28 
(C) 4 1.32 1.04 0.99 0.94 0.97 0.90 0.92 
(D) 4 3.72 2.70 3.10 2.01 2.23 (0.87 0.86 
(A)and (B) 4 2.90 a 2.27 = 2.03 = 1.54 
(C)and(D) 4 3.29 = 2.17 = 1.65 vas 0.91 
(A), (3), 

(C)and(D) 4 3.52 a 2.31 = 2.00 = 1.39 
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The data contained in [121]—[123] are qualitatively close to the results in 
[157] (a quantitative comparison is difficult here because of the differences in 
the number of quantization levels used, affecting the numerical values of the 
entropies) but are considerably more complete. In particular, Schreiber’s con- 
clusion (related to comparative monochromatic and detail-starved image B) to 
the effect that when a preceding image element is known, a further knowledge 
of any other elements alters but little the degree of uncertainty (i.e., the entropy) 
of a given element of a television image is in excellent agreement with the data 
related to the monochromatic and detail-starved images of a closeup person’s 
face (image (B) of [121], [122] and a cloudy sky (image C of [123]). It may, 
however, be noted that according to the data deduced in [122] the stated con- 
clusion does not fare badly even for all other investigated images (including also 
the most ‘heterochromatic’ image C), while the results of [123] related to the 
images (A), (B) and (D), do not corroborate it. An analysis of Limb’s data 
permits us to infer also that the use of probabilities (i.e., frequencies) calculated 
for a large and quite inhomogeneous image (whose model can be represented by 
the union of the heterogeneous parts of (A), (B), (C) and (D) in two different 
frames) leads to just a small increase in the values of the conditional entropies 
(when the brightness values of one, two or three preceding elements are known) 
in comparison to the values of the conditional entropies calculated for the parts 
of the image taken separately. Furthermore, the results of [121], [122] related 
to the conditional entropies, when brightness values of the same image element 
at one or two preceding frames are known, show that for the rapidly changing 
images under consideration these conditional entropies exceed appreciably the 
conditional entropy, given the brightness of the preceding (along the line) element 
of the same frame. Hence, by considering the relation between the brightness 
values on the succeeding frames in television transmission it is not possible to 
adduce here the considerable increase of redundancy determined from the 
analysis of the brightness distribution in one frame. The preceding conclusion 
apparently may not be valid for television subjects, for which the image varies 
less over time; however, reliable quantitative data, related to such cases, are still 
lacking (some estimates of time relations, based on indirect arguments, may be 
found in [132]). The total redundancy of television images by the data in [123] 
both in the case of an image rich in detail (a ‘close-up of vegetation’) and in that 
of a detail-starved image (‘sky’) is found to be not less than 80% (but fora 
‘medium’ image (A) or (B) it turns out for some obscure reason to be not so high, 
although it is nevertheless not less than 65%). At the same time the results in [121], 
[122] lead to the conclusion that for a detail-starved image (‘of the face of a 
person’) the redundancy is not less than 90%, and for a detail-affluent image (‘of 
many spectators’), it is not less than 60%. Note that, the values of redundancy 
in [121]—[123], larger than those found by Schreiber [157], can be naturally 
explained by a cruder division of the brightness levels. On the other hand, the 
divergences in the conclusions due to Lebedev and Limb regarding the differences 
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of redundancy between ‘monochromatic’ and ‘heterochromatic’ images are re- 
lated to the disagreement remarked above in the results of these authors on the 
rate of decrease of entropy values in the sequence Mo, H,, Ho, H3, H, for all 
images not too poor in details. (The reasons for this disagreement are not yet 
clear, but on the whole the results in [121], [122] seem to be more plausible than 
those in [123].) 

It is clear that calculations of the sort set forth in [121]—[123] and [157] can- 
not be used for the determination of relations which affect the redundancy ofan 
image, between many elements. In fact, even in the case of entropy H,, the num- 
ber of different combinations of brightness values at four points already turns 
out to be vast (recall that a comparatively crude division into brightness levels 
is applied in the works [121]—[123]), and with a further increase in the order 
of the conditional entropies, this number increases enormously which makes 
calculation intractable. Hence it is worthwhile to draw attention to a few efforts 
that were made to estimate the image element entropy and redundancy via 
Shannon’s ‘guessing experiment method’ (or some other method which does not 
involve the calculation of the frequencies of groups of many image elements). 

Apparently, the first, still imperfect, attempt in this direction was made by Parks 
[141]. He tried to apply guessing-experiment method (and also a cruder method 
of restoring the entire picture when only a part of it is exposed) to the approxi- 
mate estimation of the redundancy of three very different half-tone (i.e., black- 
and-white) pictures. Of these, the first (a close-up portrait of a sailor) contained 
the least details in comparison to the rest. The second picture (of a girl with a 
flower lying on a rug) ranged between the other two and the third one (a repro- 
duction of an abstract painting) was the most variegated of all. All pictures 
were divided by Parks into about 1,500 square elements, and the average gray- 
shade (i.e., the degree of blackness) was determined for every element. Then 
all gray-shades of different elements were divided into eight levels in case of 
the first and the third pictures and into six levels in case of the second picture. 
Parks covered all pictures with an array of square tiles corresponding to all 
the picture elements and asked a number of subjects (selected from the under- 
graduate university students in fine arts, who were unfamiliar with the picture) 
to perform guesses. Expressly, every subject was asked to remove any tile of his 
choice and then guess the blackness levels of all the remaining picture elements: 
in any order he desired. After every guess the corresponding tile was removed 
and the subject could use the knowledge of the true shade of the element in. 
guessing about the next tile. Parks does not describe in detail the experiments 
that were performed and gives only the estimates finally arrived at, which are 
apparently rather crude. According to his results, the redundancy is not less 
than 75%, 66% and 40% for the first, second and third pictures, respectively. 

A second more simple but a still cruder method of redundancy estimation 
was based on the following guessing experiment. Beginning with the picture 
fully covered with tiles, a constant percentage of tiles was randomly removed 
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and the subject was asked to describe the picture. Then the experiment was 
repeated with a higher percentage of removed tiles until the answer was consi- 
dered to be fully correct. This method may clearly give only the strongly under- 
estimated values of redundancy, which enabled Parks to draw the conclusion 
that the redundancy of the first and second pictures is apparently significantly 
above 75% and 50%, respectively (in case of the third picture this method did 
not work satisfactorily). 

Later Tsannes and his students at Tufts University [168] made a more thorough 
attempt to apply the guessing-experiment method due to Shannon (described on 
p. 188 et seq.) for an evaluation of the conditional entropy of an image element 
with regard to higher order spatial relations between the elements. Tsannes 
chose as original material 20 photographs of a section of lunar landscape, each 
of which was represented in the form of a collection of 50 x 50 = 2,500 indi- 
vidual elements, taking one of eight possible values according as its ‘degree of 
blackness’ (i.e., the levels of blackness). These photographs were further divided 
into four groups of photographs of a similar nature. One of the photographs 
(together with its numerical form, representing a quadratic table of 2,500 numbers 
from 0 to 7) was given to a guessing person (a senior student at the university), 
who studied it attentively. (The ‘familiarity with the picture’ attained in this 
way is obviously very poor in comparison with the knowledge of the structure 
of the mother-tongue inherent in every literate person, which is used in the guess- 
ing experiments related to written text, but this is inevitable). After this study, 
the same person proceeded to guess successively the elements of another photo- 
graph in the same group. In the course of guessing, movement was permitted 
in any direction after each already guessed element; to each conjecture the answer 
‘yes’ or ‘no’ was given, which was considered to contain one bit of information 
(in fact, it often contained considerably less information since both possible 
answers are not at all equally probable). Thus, the average number of questions 
per element of the image provided a rough estimate from the above (i.e., a 
strongly overstated estimate) of the average entropy of one element of the image. 
In the two guessing experiments described in [168], this average estimate turned 
out to be roughly 1.8 bits in one case and 1.3 bits in the second; the authors 
remarked that an expert in lunar landscapes, by dint of his prior practice in 
this field, would probably obtain a remarkably better result (i.e., a lower bound 
on the entropy). In any case, both estimates so obtained are found to be appreci- 
ably lower than the value Hy = 3 bits; the true entropy H is obviously signi- 
ficantly lower than these estimates. If, following Shannon’s proposition men- 
tioned in the footnote on p. 188, use is made only of the result of the more 
successful of the two guessing persons, then the corresponding lower bound on 
the redundancy of the lunar surface image comes close to 60%. 

The appearance of colour television has also given rise to the need to esti- 
mate the information contained in the colour of the image. By way of a rough 
guide, the pioneer calculations in this direction have shown that for colour 
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television images, matching in quality the colour illustrations in magazines, the 
information, in order of magnitude, compares to double the information contain- 
ed in the corresponding black-and-white images (see [132]). 


4.3.5. Phototelegrams 


Let us now take up the data concerning a phototelegram. Here, the general 
principle of image transmission is close to the telecommunication principle: the 
image is split into smallest squares (‘screened elements’), after which the infor- 
mation on the colour of each such element (whether it be black or white) is 
transmitted over the channels. Thus, compared to black-and-white television 
the images now considered are simpler: for them there are no brightness grades 
(i.e., degree of blackness) and the colour can take just two values. It is natural 
that the maximum information (i.e., the entropy Ho) contained in the knowledge 
of colour per element equals Hy = log 2 = 1 bit; this information is attained 
when black and white elements occur with the same frequency and the colour of 
each element is independent of that of all the rest. In reality, the two colours 
usually occur with different frequencies (the number of white elements as a rule 
considerably exceeds the number of black ones) and between the colours of in- 
dividual elements there is a noticeable dependence; hence the true value of the 
entropy of one element of a phototelegram is appreciably less than 1 bit. The 
task, therefore, is to determine its value. 

It can be calculated that, in the transmission of the printed text from an ordin- 
ary book or magazine by phototelegram, the relative frequency pp of white 
elements is close to 0.8, and the frequency p, of black elements is close to 0.2. 
Hence the entropy H, is given by 


H, =~ —0.2 log 0.2 — 0.8 log = 0.72 bit, 


which corresponds to the redundancy R = 1 — (0.72/1) = 0.28 = 28%. How- 
ever, this value of redundancy is grossly understated since it takes no note of 
the dependence between the colours of adjoining elements. Unfortunately, a pre- 
cise quantitative estimation of this dependence (stretching to a large number of 
adjoining elements) is highly involved; hence even approximate methods are of 
interest for evaluating the entropy H.. and redundancy R. 

One of the earliest attempts, quite a sketchy one, to estimate the entropy 
H.. = H of a phototelegraphic message is traced to the work of Deutsch [89]. 
In this work he analyzed a small fragment of an English text (a few lines long) 
printed in comparatively large letters. Unfortunately, a text written on paper 
is not in the least easy to divide directly into very small ‘screen elements’ em- 
ployed in a phototelegram and in case of such division the given fragment turns 
out to consist of a vast number of elements, which make the arithmetic calcula- 
tion of the frequencies of different combinations exceptionally tedious. Hence, 
Deutsch resorted to a partition of the given text into comparatively larger 
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squares, each consisting of many screen elements. He classified such squares as 
white or black according to which colour predominated in a square (e.g., if more 
than 50% of the area of the square is found to be white, then the whole square 
is considered to be white; otherwise, black). It is natural that in such a case 
Hy = log 2 = 1 bit for a ‘square’ and a screen element as well. Furthermore, 
Deutsch calculated the conditional entropies H,, H, and H; for vertical ‘blocks’ 
consisting of several adjoining squares (for horizontal ‘blocks’ only the entropy 
Hi, was calculated, which turned out to be slightly larger than the corresponding 
value for the vertical ‘blocks’). The entropy H, was found to be 0.67 bit, which 
conformed to the redundancy R being 33%; the entropy H, had already a value 
of 0.57 bit, ie., it corresponded to the redundancy R = 43%.+ By means of 
some indirect arguments, it was also shown in [89] that the entropy of one 
‘square’ must in fact be considerably less than 0.5 bit, so that here the redun- 
dancy R does significantly exceed 50%. Note, however, that all these figures do 
not merit any particularly great reliability, since the partition of the text em- 
ployed in [89] into comparatively large squares distorts considerably its statist- 
ical structure. 

The German scientist Kayser [112] carried out a study of this sort in consi- 
derably greater depth. He decomposed the typewritten text into much smaller 


tFor vertical blocks the entropy HN) of a block of N adjo‘ning elements for N = 1, 2, 3 
and 7 was calculated. It is interesting that the ratio H(N)/N for N = 7 was found equal al- 
together to 0.58 bit, i.c., slightly larger even than H,. This fact clearly shows to what extent 
the sequence of quantilies hy = H(N)/N, N = 1, 2, 3,... tends more slowly to Ho than the 
sequence Hy (see footnote on p. 184), 
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squares with a side of 0.2 mm (one typed page here was found to have been 
divided into roughly a million individual elements). In order to make the cal- 
culations possible with so large a statistical ensemble, Kayser constructed a 
special measuring apparatus to separate automatically the succeeding ‘blocks’ 
of a small number of N adjoining elements and register the number of blocks of 
distinct composition. This apparatus was then applied to blocks in different 
directions (horizontal, vertical and positioned at an angle to the typed text), and 
all enumeration results were found to vary only slightly with changes of direc- 
tion. Starting from here, Kayser confined the data analysis mainly to horizontal 
blocks. In relation to such blocks he investigated the dependence of the speci- 
fic entropies hy = H‘)/N, with N = J, 2, 3, 4, 5 and 6, on the following fac- 
tors: (a) the extent of ‘boldness’ (i.e., the thickness of the letters) of the text, 
(b) the distance between the lines, and (c) the size of the typescript (i.e., the 
degree of magnification of the typescript copy). The results obtained by him 
with respect to the text of standard ‘boldness’ and size and five different distances 
between the lines (from the densest typescript ‘through a single space’ up to the 
least dense typescript ‘through three spaces’) are shown in Fig. 17. From this 
it is seen that the redundancy of the most densely typed text {though normal 
in all other respects) certainly exceeds 50%, whereas for the least dense type- 
Script it is not less than 80% (but, apparently, these figures are sharply under- 
Stated, since h, is a very rough estimate of the quantity H..). In the case of 
thinly typed text all entropies naturally turn out to be smaller, and the redun- 
dancies larger than those for a standard text; particularly, the value of h, = H, 
is appreciably reduced; however, for increasing N the values of Aw for thin print 
tend progressively to the values for ordinary print. Conversely, all entropies 
for very ‘heavily’ set text are found to be larger than those for normal text, 
the greatest difference being again observed for N = 1, and the smallest for 
N = 6. For homothetic magnification of a typed text the values of h, = H, are 
not affected (since the fractions of white and black elements are not altered), 
but in this case the statistical relations between the adjoining elements increase, 
and hence all entropies hw with N > 1 decrease but the redundancies increase. 
In relation to the values of hw with N > 6, only some quite rough estimates are 
deduced in [112], according to which, say, for a single-spaced standard typed 
text, h,. = 0.40-0.45 bit. 

It is clear that the quantities hw for small N by no means characterize the 
complete redundancy of a typed text brought about by all statistical relations 
existing in such text. This is seen, in particular, from the fact that by applying 
quite a different method Kayser obtained results that differ sharply from those 
described above. The measuring instrument he constructed surely could not 
properly recognize the fact that all black elements in its field of vision are por- 
tions of 26 letters of a well-defined form. Hence Kayser did further work to 
determine what is the smallest segment of a square closely covering a letter, by 
the sight of which a literate person is able to guess what letter it is. The 
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experiments undertaken with this objective showed that if for each letter its 
most characteristic part is selected, then it suffices to show only about 15% of 
the area of the square. Hence it can be inferred that the redundancy of a two- 
dimensional figure of an individual letter (and, consequently, also of a very 
closely typed text) on an average is close to 85% (the blank spaces between the 
letters, words and lines in a printed text in general can be considered as entirely 
redundant, however). In addition, we must note that only a part of an isolated 
letter was shown; however, if the entire text preceding this letter were known 
in advance, then quite often the letter can be guessed correctly even without 
looking at any part of it. Hence it is clear that the size of the part needed to guess 
one text letter would be on an average appreciably less than 15%. Starting from 
the data of [118], which is mentioned on p. 195, Kayser concluded that know- 
ledge of the preceding letters of a typed German text must decrease the limiting 
amount of uncertainty (i.e., H..) by approximately a factor of three. Hence he 
arrived at the result that the true redundancy of a closely typed text is obviously 
close to 95%. This redundancy estimate evidently makes allowance for the highly 
complex statistical relations covering simultaneously many ‘screen elements’, 
generated by both the ways of letter writing and grammar and structure of the 
language; their employment in phototelegraphic engineering is, of course, still 
a remote possibility. 

In the following we shall no longer take note of the semantic and grammatical 
properties of phototelegraphic texts, and instead consider only the statistical 
regularities in the mere interchange of black and white screen elements. In this 
case a fairly good estimate of the entropy H of one screen element can be ob- 
tained by representing each line of a phototelegram in the form of a sequence 
of alternating white and black sections of different lengths. By calculating the 
relative frequencies of the appearance of all such sections the corresponding 
‘first-order entropy’ H{S°*t°") can be calculated; here the ratio H sseeHen) w, where 
w is the average number of elements in one section, is surely greater than the 
true value of the entropy H of one element (See the discussion on p. 187). By 
means of this method, Michel [130] showed that in the transmission of a densely 
typed (‘single spaced’) text, in large type, the entropy H is smaller than 0.3 bit, 
i.e., the redundancy R exceeds 70%; a similar conclusion is also obtained in [112] 
by using the same method. A more detailed investigation of this sort has been 
carried out by Garmash and Kirillov [103] on the basis of quite extensive statist- 
ical material for Russian printed book or magazine text. These authors calculat- 
ed not only the frequencies of monochromatic sections of various lengths, 
but also the frequencies of all possible pairs of such sections and deter- 
mined from this data the first-order section entropy prisection) and second-order 
entropy H, (section) By calculating the ratio HCSection) jy, they determined that in 
the transmission of printed text H < 0.33 bit, i.e., R > 67%; the inequality 
H< HSECHOD) yy allowed them to refine further this estimate and show that 
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H < 0.28 bit and, correspondingly, R > 1 — 0.28 == 72%. 

Another method of estimating the entropy H and the redundancy R for a 
phototelegram is due to Vasiliev [171] and Frolushkin [99]. It is clear that an 
exact calculation of the entropy H‘”) of an experiment, consisting of determin- 
ing the colours of N successive screen elements, for large N, is highly involved 
because of the fact that the total number 2” of outcomes of this experiment is 
extremely large. Hence we divide the corresponding 2" outcomes into some n 


groups containing respectively M,, Ms,..., Mn outcomes (where M, + M, 
+...+ M, = 2%) and we determine only the probabilities q,, q.,..., qn of 
the successive N elements belonging to the Ist, 2nd,..., nth group. Assume 


now that within each group all outcomes are equally probable (the nonfulfilment 
of this restriction can only decrease the entropy H‘"!), and determine the value 
of H™? subject to this assumption. In this case, the outcomes belonging to the 
ith group (where i can take the values 1, 2,...,) contribute M, identical 
terms —(qi/M;) log (g;/Mj) to the expression for HY). This implies that 


H'") < -q, los ar — gq log —...—Qnrilog vm (*) 


(the use of the < sign is connected to the fact that our calculations yield in 
general an exaggerated value of H'%). Similarly, by assuming that one of the 
outcomes of ith group has probability 1 and all the rest have probability 0, i.e., 
they are impossible (the nonfulfilment of this restriction can only increase the 
entropy H'%!), we obtain 


H”) > qi log qd — Wa log Gg — +++ 7 An log dn. (**) 


Vasiliev [171] started from the fact that in the transmission of printed text a 
quite significant part of the redundancy is related to the high frequencies of com- 
paratively long sections of N white elements (which arise because of the presence 
of interline spaces and margins). In agreement with this, his first group of out- 
comes is formed from a single outcome, the one in which all N elements are 
white; the remaining 2” — 1 outcomes make up the second group. In this con- 
nection, formulae (*) and (**) yield 


i= 
—q log g — (1 — q) log > H™) > —q log q — (1 — q) log(1 — q), 


where q is the probability of a ‘white’ block of N screen elements. If it is fur- 
ther noted that for large N the expression 2" — 1 is almost the same as 2%, so 
that log (2" — 1) can be replaced by log 2” = N, then 


> — q log g — (1 — q) log (I — 4) 
f N ’ 
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where hy = H'%)/N is the approximate value of the ‘specific entropy’ of one 
screen element. In order to obtain a satisfactory estimate of H = H. aim hn 
— oo 
it is necessary to take N of the order of ten or several tens; in this, q for news- 
paper text turns out close to 0.5 (or even more), and for typed text set in the 
ordinary way (‘double spaced’) close to 0.7 (or more). It is hence clear that in 
the transmission of newspaper text H < (1/10) + 0.5 = 0.6 and R>1—0.6 
= 40%; in the transmission of ordinarily typed text 


—0.3 log 0.3 — 0.7 log 0.7 


2 10 


N 


+ 0.3 ~ 0.39 and R > 1 — 0.39 = 61%. 


The value of such a comparatively rough estimate of the entropy H lies in the 
fact that here it is easy to specify a concrete coding method, which permits the 
transmission to be conducted at the rate 


Plt LE eee ee: || oS eee ees 

H  —qlogg—(i —q)log(1—q) + NQ —4) 
(per screen element/unit time), where C is the capacity of the communication 
channel being used (see [171]). 

In [99], all blocks of N screen elements are partitioned into a large number 
of groups, characterized by definite values of ‘saturation’ and ‘mesh’, By ‘satur- 
ation’ is understood here just the total number of black elements in a block (so 
that for a block of N elements the ‘saturation’ can take N + 1 values: 0, 1, 2, 
..., N), and by ‘mesh’ the number of monochromatic sections into which a 
given block is partitioned (the ‘mesh’ of a block of N elements can equal 1, 2, 
3,... or N, i.e., can have N distinct values). The calculation of the values of 
‘saturation’ and ‘mesh’ of individual blocks was carried out automatically by 
means of a convenient special device constructed by Frolushkin. The value of 
N taken in [99] is 100, i.e., the quantity H) is evaluated and the entropy H 
of one element is equated to Ayg) = H|™)/100. In connection with such a choice 
of N the measuring circuit is provided with a device, which automatically 
switches the circuit on for the time interval, corresponding to the transmission 
of 100 screen elements of a phototelegram through a channel; after that the 
circuit is switched off, the values of ‘saturation’ and ‘mesh’ are registered and it 
is only after this that another section of the phototelegram is fed into the circuit. 

Phototelegrams with handwritten, typed and printed (newspaper) texts were 
analyzed separately, where in all cases the phototelegrams were filled with the 
densest possible text, as is customary in real transmission. Each of the three 
types of texts was represented by 10 extracts and from every extract 400 distinct 
blocks of 100 elements were selected. From the data obtained the frequencies 
(approximate values of probabilities) of different values of ‘saturation’ and ‘mesh’ 
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were obtained, and also the frequencies of different combinations of the values 
of ‘saturation’ and those of ‘mesh’. By calculating further the number M(G2Ur) 
of blocks having the given ‘saturation’ n, the number Mimesh) of different blocks 
having a given ‘mesh’ m and, finally, the number Mn,m of blocks having simul- 
taneously the ‘saturation’ m and ‘mesh’ m (all these numbers can be determined 
by means of simple combinotarial arguments), and by using the formula (*) 
(p. 242) we obtain three different estimates of the entropy H (and, consequently, 
of the redundancy R = 1 — (H/Hp)). It is clear that all these estimates slightly 
overstate the value of H (and understate the value of R), though the third of 
them (corresponding to the division into the greater number of groups), in prin- 
ciple, ought to be more precise than its two predecessors. 

Estimates of the values of H and R for the three types of text obtained as a 
result of these investigations are listed in the accompanying table. It is seen 


Evaluation by the data Evaluation by the data 
on ‘saturation’ on ‘mesh’ 
A (in bits) R H (in bits) R 
Handwritten text 0.37 63% 0.22 78% 
Typed text 0.53 47% 0.30 70% 
Newspaper text 0.43 57% 0.34 66% 
Average 0.44 56% 0.29 NY 


that the estimate of H obtained from the data on ‘saturation’ is found to be 
appreciably cruder than the estimate conforming to the data on ‘mesh’. Hence 
it can be inferred that the assumption of equi-probability of all different blocks 
with the same ‘mesh’ is in better agreement with the facts than that of all blocks 
with the same ‘saturation’. In other words, the blocks with the same ‘mesh’ 
form a more homogeneous group than do those with the same ‘saturation’. 
The evaluation of the entropy H from the data on the probabilities of all 
possible combinations of ‘saturation’ and ‘mesh’ demands a considerable increase 
in the volume of the subject material. In fact, it is easy to calculate that for 
blocks of 100 elements it is possible to form in all nearly 5000 (precisely 5001) 
such distinct combinations. Consequently, an entire set of all possible different 
blocks (containing 21° > 10% blocks, i.e., the number of blocks expressed by 


fit is easy to show that in the general case of blocks of N elements 


(satur.) (N\ _ N! (mesh) _ N-1\ _ 2(N —1)! 
Mn -(,)-a0—m end ee =2( ~ (a — DIN — m)! 


(the latter formula follows from the fact that in this case the m — 1! ‘boundaries’ between 
f : — 1 5 F 
different monochromatic sections can be chosen in ( oe , ) different ways, and after this 


the first monochromatic section can be chosen either as black or white by an arbitrary rule). 
As to the number M,,,,, it is given by a more complex formula, which we shall not deduce 
here, 
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a 31-digit number!) is split up here into 5001 individual groups. It is clear that 
the probabilities of all these groups are by no means possible to evaluate from 
the data on the frequencies obtained during the analysis of 400 x 10 = 4000 
different blocks. Hence, the third estimate of the entropy is given in [99] only 
for ‘average Russian text’ (on the basis of the data on the frequencies of indivi- 
dual groups in the entire collection of analyzed blocks irrespective of the text 
from which they are extracted). This estimate, obtained by using formulae (*) 
and (**), has the form 


0.23 > H > 0.06, ie., 71% <R< 94%. 


Here the true values of the entropy H and redundancy R apparently ought to 
lie somewhere between the stated limits. 


During our discussions on phototelegrams, we have so far considered only 
cases of the transmission of text material (handwritten, typed or printed) through 
a phototelegram channel. However, phototelegrams can also be used for the 
transmission of a number of different types of white-and-black messages, and 
for a number of them the values of the average entropy (for one screen element) 
and the redundancy may turn out to be quite different from that in a literal text. 
Thus, for instance, it is clear that in the case of a drawing the redundancy would 
be expected to be appreciably higher than in the case of a text (in the first place 
due to the fact that in a drawing ‘black’ occupies a much smaller place than in 
a sheet of literal text). This conclusion has already been verified by earlier 
(though highly crude and expressly appreciably overstated) estimates of the 
entropy H for drawings obtained (on the basis of data on the length distribution 
of monochromatic sections) by Michel in his work [130] mentioned above. 
According to the estimates due to Michel, in the case of intricate radio-circuit 
diagrams which include a series of inscriptions it can be confidently stated that 
H < 0.12 bit, i.e., R > 88%, while for a simple drawing the entropy H can 
turn out to be less by even more than half (i.e., the redundancy exceeds 95%). 
A more accurate (but also considerably more complex) method for an approxi- 
mate evaluation of the entropy and redundancy of simple drawings (consisting 
of a number of continuous lines) was set forth by Foy [95]. In the case of a 
model example analyzed in [95], the calculation of just the deviation of the 
relative frequency p, of black elements from } led to the estimate H < 0.08 bit, 
R > 92% (here the value of p, is close to 0.01), whereas the employment of the 
More accurate method due to the authors permits one to obtain the following 
result: H < 0.015 bit, R > 98.5%. As to the pictures and photographs to be 
transmitted through a phototelegram, these types of messages in fact differ little 
from black-and-white television pictures; hence we need not dwell exclusively 
on the data of their entropy and redundancy, and instead refer the reader to the 
preceding sub-section of this chapter. 
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4.3.6. Capacity of real communication channels 


Let us now discuss briefly the question of the practical fruitfulness of the esti- 
mates of entropy and information of various messages in communication engi- 
neering. The role of entropy in the theory of message transmission is defined 
by the fundamental! theorem of Section 4.2 (pp. 172-173). According to this 
theorem, the maximum value of the transmission rate v attainable over a com- 
munication channel is defined by the formula 


v= £ element/unit time, 


where H is the entropy of one element of a message (no matter whether it is a 
letter, phoneme, note, element of a teleimage, or screen element of a phototele- 
gram), and C is the channel capacity. Hence, in order to find the limiting trans- 
mission rate it is necessary to know not only the entropy H, whose determina- 
tion for different cases has been dealt in the preceding sub-sections, but also the 
capacity C. The question arises as to how to determine such capacity. 

In section 4.2 it is seen that 


C = Liogm, 


where L denotes the number of elementary signals that can be transmitted 
through a channel in unit time and m denotes the total number of distinct sig- 
nals to be used. In practice, the number m is often chosen with the condition 
that for the corresponding communication channel it is possible to set up a suffic- 
iently simple and inexpensive transmitting and receiving device. Thus, for 
example, most often in all two elementary signals are taken (ordinarily, on and 
off current). This is due to the fact that the problem of distinguishing two such 
signals at the receiving end is technically most straightforward and correspond- 
ing receiving devices are most economical and reliable. However, for those cases 
in which it is required to transmit as many messages as possible within unit time, 
it is natural to ignore considerations of simplicity and economy of channel cir- 
cuit and strive to increase to the maximum the values of L and m. And, at first 
glance the opportunities offered here seem to be completely unlimited: usually 
the signals transmitted over a communication channel may vary continuously, 
so that, as is apparent, it is possible to choose them as short in length and as 
slightly different from each other as desired. But this implies that L and m can 
be made as large as desired and, consequently, the capacity of any channel, 
transmitting continuous signals, is factually unbounded. The question arises as 
to what role is played in such case by larger or smaller values of the entropy H. 

In reality, however, the arguments set out here are not true: any communic- 
ation channel, transmitting continuous signals also has a strictly limited capa- 
city. In the first place, the value of a transmitted signal can never be changed 
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instantly —for this a definite time is always required. In practice, in a commu- 
nication channel being used the minimum time required for a noticeable altera- 
tionof asignal is strictly regulated by the engineering characteristics of the channel 
itself. This leads to the fact that for every channel only values of a signal at 
the time points divided into a definite minimum time interval ty) can be chosen 
more or less arbitrarily: after these values are chosen, all values of the signal 
are defined uniquely in the intervening instants of time. In other words, the 
maximal number L = 1/t of distinct elementary signals that can be transmitted 
through a communication channel in unit time is a fixed characteristic, which 
cannot be altered without introducing changes into the channel itself. This 
position, which plays a central role in all applications of information theory to 
the problem of the transmission of continuous signals, was stated clearly even 
before the origin of modern information theory (in 1933) in a report by Kotel’ni- 
kov. The main result of Kotel’nikov’s work which was also obtained indepen- 
dently by Shannon in [21] and [158] permits to express the number L in terms of 
the usual engineering characteristic of a communication channel (in terms of the 
so-called ‘transmission band width’). The expression obtained shows, (say) in 
the case of radio-communication, that the replacement of a channel with the 
object of increasing the values of L may not bring an advantage since it makes 
the operation of other radio channels impossible, driving the transmission over 
close wavelengths (see, for example, [24], [25] or [115]). 

But suppose that only the number m can be chosen as large as desired, then 
this suffices to attain as large a capacity C as desired. Unfortunately, even this 
assumption is not true. At the outset, we cannot use signals of arbitrarily high 
intensity since for this we have to utilize vast power for their production. There 
exists a strictly definite average power P for the signals to be transmitted, defin- 
ed uniquely by the energy source of our communication channel. In addition, 
we also cannot distinguish signals whose values are too close to each other. We 
confronted this situation on pp. 228-230 where the maximal degree of closeness 
under which signals could be still distinguished, was determined purely by phy- 
siological factors (‘resolving power’ of the eye or ear). In the case of artificial 
communication channels, reception is effected by a special device, and at the 
price of modifying and further raising the cost of this device, its resolving 
power can be made practically as high as desired, i.e., a device that distinguishes 
between even extremely close signals can be produced. But there is one more 
factor that obstructs the discernment of close signals, noise. The fact is that 
in every communication channel there exist disturbances which can by no means 
be eliminated; these disturbances distort the value of the transmitted signal. 
In the case of electro-communication, for instance, these disturbances can be 
produced by small oscillations of the load in the network, by the electrical field 
of adjacent circuits and neighbouring electrical machinery, or even just by the 
random ‘thermal’ motion of electrons in the conductors (this motion depends 
on the conductor temperature and is completely similar to the chaotic motion 
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of gas molecules). In the case of radio-communication they can originate in 
lightning discharge in the atmosphere or electrical discharges created by indus- 
trial or transport facilities (say, by the sparking of the arc from a nearby passing 
tram) and so on. If we denote by W the average power of these disturbances 
(i.e., the power of those distortions to which our signals are subjected in the 
process of transmission), then those signals, between which the variation in 
power is much less than W, are impossible to distinguish by any device at the 
receiving end—the small distinction between them is completely ‘masked’ by 
considerably larger ‘random’ distortion. Hence only signals that differ by not 
less than a certain definite value turn out to be discernible here. Since, in addi- 
tion, the maximal level of our signals (defining the average signal power P) also 
cannot be unboundedly large, there can be only a finite number m of levels of 
signal values distinct from each other. A quantitative analysis of the situation 
arising here has been carried out by Shannon [21] (see also [24] or [25]), showing 
that in general the number m can be defined by the equation m = v1-+ (P/W). 
Thus, we arrive at the following expression for the capacity C of an arbitrary 
channel, transmitting continuously varying signals : 


C= L,log(1+ 7), L=% (*) 


(where L, is some ‘universal’ characteristic of communication channels, irres- 
pective of the signal to be transmitted). The conclusion that stems from this 
remarkable formula is one of the most important contributions of information 
theory to general communication theory. 

The deduced formula enables us to calculate easily the capacity of every con- 
crete communication channel, In fact, apart from the engineering characteristics 
of a channel itself, it is also necessary to know the signal to noise ratio i.e., P/W. 
For teletransmission channels, C usually turns out to have an order of tens of 
millions of bits per second; for telephonic, phototelegraphic and radiotransmis- 
sion channels, C varies from several thousands to several tens of thousands 
bits per second, and for telegraph channels C is of the order of tens or hundreds 
bits per second (see, for example, [115], [132] or [166]). 

It is essential here that the existing channel capacity in all cases (except, per- 
haps, telegram) theoretically permit information transmission at a considerably 
higher rate than that achieved during ordinary practical transmissions. Thus, 
(say) in telegraphy, the information is transmitted usually at a rate not exceed- 
ing 75 bit/sec; in telephony, at a rate not exceeding 2,500 bit/sec; in television at 
a rate not exceeding 500,000 bit/sec. Hence all methods actually being employed 


tHere we speak only of the capacity of a channel, transmitting continuous signals, since 
the case of the transmission of discrete signals in the presence of noise is especially examined 
in the next section. 
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at present for message transmission utilize as a rule just a smal! part of the avail- 
able communication channel capacity. A higher capacity utilization prescribes 
the application of considerably more effective methods of encoding and decod- 
ing; this gives rise to many difficult problems, theoretical as well as purely appli- 
ed, that are presently engaging the attention of a large number of workers all 
over the world (we shall speak of this in more detail in Sec. 4.5). Note that 
recent achievements in the field of the theory and practice of encoding and 
decoding now enable us in principle to enhance substantially the effectiveness of 
the use of communication channels: thus, in experimental transmissions espe- 
cially organized by American scientists and engineers, an information transmis- 
sion rate was successfully achieved that was of the order of 7,500-8,000 bit/sec 
over telephony (see, for example [25], p. 762; [94] or [199], p. 7) and 20,000,000 
bit/sec over television (see [94]). However, even such information transmission 
rates nevertheless seem to be inadequate for future needs—the total amount of 
information to be transmitted through existing communication channels tends 
to increase every year in a majority of countries, and in the near future we may 
expect the evolution of new transmission models (say, video telephone), and also 
the emergence of two-way television communications between individual insti- 
tutions in various cities and a massive use of direct digital data transmission. to 
large centralized computer centres, which promise a significant acceleration of 
this process. Hence in present times in a number of world laboratories a start 
has been made to exploit fully new forms of communication channels having 
appreciably larger capacities, the foremost of them being the metallic and 
dielectric wave guide channelsf with capacities of the order of 5-108—-1-10° bit/sec 
and optical wave guides of glass fibres with a capacity of the order of 10° bit/sec 
per fibre. Such projects were discussed, in particular, at the International Con- 
ference on Communication Engineering held in Montreal in June 1971, at the 
International Information Theory Symposium at Tsakhkadzor, Armenian SSR, 
in September 1971 andat many other conferences related to communication engi- 
neering. Of course, the actual introduction of such new communication channels 
demands further circumventing of a large number of technical! difficulties—but 
the very fact of the emergence of such studies is quite significant. 

It is interesting to note that the mechanical concept of capacity can also be 
carried over completely to those ‘communication channels’ through which every 
living organism receives information from its sense organs. In fact, we have 
already described in Chapter 2 special psychological experiments, which show 
that the time required for assimilation of any information by the central nervous 
system is directly proportional to the amount of this information; thus the same 


tThe wave guides (radio and optical) are factually the pipelines through which the waves 
are propagated. The presence of an outer shell enables us to decrease stronlgy the noise level 
and together with this to use a very wide frequency band without creating interference with 
other communication channels. 
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laws are satisfied here as hold for all communication channels. Recently, some 
literature has also been published justifying the applicability of Shannon’s for- 
mula (*) (p. 248) to the nervous communication channels in the human organ- 
ism; however, at this point it is impossible to consider that the last word has 
been said on this question. 

The capacity C of individual sense organs can be estimated quite roughly on 
the basis of the physiological data of their resolving power (i.e., the total num- 
ber of objects to be distinguished by means of some sense organs) and the aver- 
age time required for perception (i.e., the maximum frequency of the change of 
external influences during which these influences can all still be interpreted sepa- 
rately). This permits one to show, in particular, that the capacity of individual 
sense organs differs sharply; the human eye under favourable lighting conditions 
is obviously capable of receiving (and transmitting to the central nervous system) 
information at a rate of the order of millions (or tens of millions) of bits per 
second, whereas the ear receives information at a considerably slower rate—of the 
order of thousands of bits per second (see, for example, [109], [110], [114] and 
[156]). Such variances in the capacities can partially be explained by the sharp 
difference between the number of nerve fibres that serve, respectively, hearing and 
vision (according to modern physiological data, the number of ‘aural nerve fibres’ 
is of the order of 30,000, in contrast to roughly 800,000—900,000 ‘optic nerve 
fibres’). Thesense of touch, in terms of its capacity to receive and transmit inform- 
ation, obviously lies somewhere between vision and hearing. However, it is 
worth noting that only a quite small part of the information transmitted to sense 
organisms can be assimilated consciously by the human brain; this clearly follows, 
for instance, from the data enunciated on p. 218 on the information reception 
rate in a conversation (it was observed there that when a conversation is rapid 
a part of the ‘unsemantic’ information is lost, since the listener has no time to 
reinterpret it). A detailed analysis of results concerning the maximum speed 
attainable in speech, reading, writing (shorthand) and so on, shows that in all 
cases a person is able to comprehend the information received only if the rate 
of its receipt does not exceed roughly 50 bit/sec (see, for example, [119] and 
[136]).f A quantity of the same order is also obtained in determining the amount 
of visual information to be assimilated by a spectator by a quick glance at chang- 
ing figures projected on a screen [164]. Finally, especially designed experiments 
for the determination of the minimum physiological reaction time (see p. 56 et 
seq.), attainable under the most favourable conditions of reception, also show 
that the capacity of the human central nervous system is approximately 
equal to 30-40 bit/sec (see, [136] and [1£0]). Obviously, there still remains 


tRecall also that in agreement with what is stated on p. 218 fora normal conversation just 
about half of the information to be received by the listener is contained in the written version 
of the speech; the rest of the information concerns the voice of the speaker, his emotions, 
‘jnsistence’ stresses and so on. 
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much to be explored} with respect to further sharpening of these figures and 
clarifying their dependence on individual peculiarities of a person and his 
physical and mental condition. However, the fruitfulness of applications of 
the general ideas of information theory to the study of nerve activities of human 
beings and animals is no longer a suspect and is an established fact in its own 
right. 


4.3.7. A general scheme of information transmission through communication 
channels. Genetic information transmission 


In the present concluding subsection, we shall say a few more words on the 
general scheme of message transmission through communication channels, which 
formed factually our starting point in Section 4.1. The process of message 
transmission through an arbitrary communication channel can be schematically 
presented as follows : 


Input Input Noise Output Output 
message signal signal message 
| 
+ 
(SSS ee ee 8 ee] 
Communication 
—~——_,- —-— + channel a a pe ed 
Coding Decoding 


In the case, (say) of the transmission of some texts through a telegraph chan- 
nel, the input and output messages «, and 8, are written in a definite (one and 
the same) language by means of the appropriate alphabet letters and can differ 
from one another only because of distortion in the transmission process, and the 
input and output signals « and 8 represent sequences of electric ‘elementary’ sig- 
nals’ (usually on and off currents). Thus, the coding and decoding operations 
consist here of the conversion of letter message a: into a sequence of ‘element- 
ary signals’ « and in the reverse passage from the accepted sequence of element- 
ary signals’ 8 to the letter message B,. In a telephonic message transmission 
along a wire, «, has the character of sound, i.e., it is a Sequence of air pressure 
fluctuations; the coding consists here of the transformation of these pressure 
fluctuations into electric current fluctuations, and decoding in the reverse trans- 
formation of accepted current fluctuations into sound. In the communication 
channels of modern electronic computers, the input signal «, is a definite sequ- 
ence of numbers, the coding consists of its conversion into a definite sequence « 
of electric signals, directly fed into the computer, and decoding consists of the 
transformation of signals 8 received in the computer (representing the sum of 
‘input signals’ « and the ‘distortion in the input process’) into an entirely new 


+See, in particular, a survey of this question in [50} and references to the original literature 
listed there, which contain a multitude of data contradicting each other. - 
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message 8,, representing the solution of the problem we seek to solve with the 
aid of computer. Here #, in principle differs from «, and the conversion of a, 
into B, is the main goal of our communication channel. Similarly, in the case of 
the transmission of a visual ‘image’ through optical nerve fibres the ‘messages’ 
a, and @, differ sharply from each other—here «, consists of a collection of light 
waves of distinct wave lengths (i.e., distinct colours) and different amplitudes 
(i.e., intensiveness), and 8, is a collection of stimulations from definite nerve 
cells (neurons) of the brain (the so-called ‘visual neurons’), which are percept- 
ible to us as a certain visual picture. The signal « in this particular case is a 
Collection of electric pulses produced by the receptors of light (cones and rods 
‘of the retina) in the eye, and the coding consists of the conversion of light into 
such impulses, which so far have not been well studied. The decoding consists 
here of the transition from electric impulses 8, reaching up to the brain through 
nerve fibres, to the stimulations of neurons £,, but its details are still consider- 
ably less known than those of the coding. 

The general description of an arbitrary noisy communication channel and the 
determination of the theoretical limitations of the opportunities of using such 
a channel in information transmission, are examined in Section 4.4; and the 
concluding section (Section 4.5) forms an introduction to the extensive theory 
of optimal coding and decoding of discrete messages transmitted over noisy 
communication channels. However, we may only remark here that in many 
cases even the study of the ‘alphabet’ itself in which the messages «, and B, are 
written, and of the nature of ‘elementary signals’ « to be transmitted, is of great 
interest and not at all straightforward. The most striking example in this con- 
Nection is the problem of transmitting genetic information, whose successful 
Study is traced to a number of most outstanding scientific achievements during 
-the last three decades. 


In view of the general scientific importance of this topic and its intimate relation to the 
general formulation of the information transmission problem, it is appropriate to dwell here 
upon the related results in some detail. The ‘communication channels’ associated with the 
heredity phenomenon play a primary role in the very existence of organic life. Through these 
channels vast and extremely important information is constantly transmitted with striking pre- 
cision. On Earth in all nearly 2 million individual species of animals and plants are recorded— 
and over the ‘communication channels’ under consideration signals are transmitted to indicate 
precisely what particular species ought to grow from a single embryonic cell. The information 
transmitted here is not at all restricted to just a single indication of species—it contains also 
sufficiently comprehensive data concerning the peculiarities of the structure of the species and, 
in addition, data concerning the hereditary singularities of an individual organism developed 
from a given cell. Al] this information is preserved somewhere in the extremely small volume 
of the nucleus of the embryonic cell and is transmitted through some sufficient complex path- 
way to the substance (‘cytoplasm’) of both the primary cell and all other cells that are pro- 
duced in division processes originating at a given cell; it is preserved even in the process of 
subsequent reproduction of succeeding generations of similar species. 

The construction of appropriate communication channels and the methods of information 
transmission over them seemed to be quite mysterious until recently. It was hardly possible to 
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foresee the rapid developments in this field that were linked to the spectacular advancements 
of motecular biology in the period following the last world war. A central role in this regard 
was played by the discovery of the fundamental importance of the enormous chain-like poly- 
mer molecules of the so-called deoxyribonucleic acid (abbreviated DNA), arranged in the 
chromosomes of the cell nucleus. It is known that these molecules consist of long alternating 
chains of carbohydrate and phosphate groups of identical composition. To each carbohydrate’ 
group there is also attached the side group froma collection of four standard bases called 
adenine, guanine, cytosine and thymine. A\\ distinctions admissible in the DNA molecules are 
restricted to those*concerning the successive interchange of corresponding bases (which, for 
brevity, may be denoted by their first letters A, G, C and T, or may also be just numbered by 
the digits 0, 1, 2 and 3), Thus, the original ‘message’ «, is preserved here in the chromosomes 
of the cell nucleus and written in a ‘four-letter alphabet’ of DNA molecules. One DNA mole- 
cule can store in a chromosome several tens of thousands or even more carbohydrate groups 
(and, cOnsequently, also bases), and the number of individual chromosomes in a cell nucleus 
can amount to several tens; thus, the amount of information that can be stored in a chromo- 
some is of the order of 


log 41009000 — 200,000 bits 


(or even more). Thus, the amount of information that can be stored surpasses in abundance 
all transmitted data inherited by us. 

In fact, the chromosome Structure is still slightly more complex—usually chromosome in- 
cludes not a single, but a double strand of DNA, composed of two such molecules, which are 
condensed in the form of two helixes coiled in the opposite directions around one (not actually 
existing) cylinder. These two DNA molecules are not identical but are ‘complementary’— 
adenine in one of them always corresponds to thymine in the other and guanine to cytosine; 
the corresponding pair bases arranged on the cylinder surface opposite to each other are linked 
by acomparatively weak hydrogen bond. Such ‘double helix’ chromosome structure plays a 
key role in the process of their replication during cell division (‘mitosis’), when each of the 
two new (daughter) cells reproduces for itself a set of chromosomes identical to that possessed 
by the original (parent) cell; this process is apparently related to the ‘uncoiling’ of the two 
DNA strands entering a chromosome, during which the two long DNA molecules get separat- 
ed from each other and then each attaches itself to one more ‘complementary’ chain, forming 
an independent double helix. The resulting transmission of information from the parent 
cell to daughter ceils plays a fundamental role in all biological phenomena; here the set 
of chromosomes (DNA chains) of the parent cell plays the role of the input ‘message’ a, to 
be transmitted and that of the two new daughter cells serves as the ‘output message’ B,. The 
‘output message’ 8, is obtained here directly from the ‘input message’ «, and this makes super- 
fluous the problem of coding and decoding of ‘messages’. At the same time the question of 
‘noise’ in our communication channel is unusually important, because the distortions arising 
as a result of such ‘noise’ (caused, say, by radioactive irradiation of a cell) represent variations 
in hereditary characteristics (‘mutation’), which occupy a central position in the process of 
the evolution of organic species. 

We now pass on to the information transmission from a chromosome to the ‘body’ 
(= ‘cytoplasm’) of the cell, which determines the development of any living creature, from 
one-celled organisms to man, from a single embryonic cell. An important role in all vital 
functions of an organism is assigned to the proteins, in particular to enzymes, which control all 
biochemical reactions that take place in living organisms. Prote‘n synthesis takes place in the 
so-called ribosomes—small! formations within the cytoplasm of a cell. The structure of pro- 
tein molecules is also quite simple—all proteins are constructed from roughly 20 different 
amino acids, interchanging in a definite order along the linear protein molecules. These amino 
acids are listed in the accompanying table together with the abbreviations of their names that 
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are accepted in biochemistry. Each protein has its own characteristic sequence of amino acids 
ranging between 100 to 300 or more. 


Amino Acids Abbreviations Amino Acids Abbreviations 
Alanine ala Leucine leu 
Arginine arg Lysine lys 
Asparagine asn Methionine met 
Aspartic acid asp Phenylalanine phe 
Cysteine cys Proline pro 
Glutamic acid glu Serine ser 
Glutamine gin Threonine thr 
Glycine gly Tryptophan try 
Histidine his Tyrosine tyr 
Isoleucine ilu Valine val 


Thus, it can be said that ribosomes serve as the receiving end (‘output’) of our communic- 
ation channel; the ‘output message’ 8, in this case is represented by proteins and it is written 
‘in a ‘twenty-letter alphabet’ of amino acids. It further remains to clarify just that how the 
transfer of information from DNA to proteins takes place, and in particular, what ought to 
be understood by the ‘input signal’ « and ‘output signal’ 8. 

A completely satisfactory answer can be given at present to the latter question. A key role 
in the process of information transmission from chromosome DNA to protein molecules is 
played by still another nucleic acid, the so-called ribonucleic acid (abbreviated RNA). The 
structure of RNA is closely similar to that of DNA, only the carbohydrate group is somewhat 
different here and thymine is replaced by a different base uracil, varying slightly from thymine 
in chemical composition. Thus, an RNA molecule can be considered as a ‘signal’ encoded 
with the aid of four ‘elementary signals’ A, G, Cand U (or 0, 1, 2, and 3’) that are quite simi- 
lar to the ‘letters’ of the original ‘message’ A, G, Cand T. Along the DNA molecules of 
chromosomes, as along a certain ‘template’, definite linear molecules of RNA (the so-called 
‘messenger’? RNA or mRNA) are synthesized, which subsequently separate from the cell nuc- 
leus and penetrate into the ribosomes; these mRNA molecules play an important role in the 
process of protein synthesis. Thus, the general scheme depicted on p. 251 for information 
transmission through communication channels has the following form in the case considered: 


Picasa ecishiste Ribosomes 
OSS ee 


ipsa] ——-[ nana mRNA “mRNA | ———- mRNA mRNA |——3] Proteins 


Here the role of the ‘input message’ «, and ‘output message’ B, is assigned to the DNA and 
proteins, respectively, and that of the ‘input signal’ « and ‘output signal’ 8 to the molecules of 


mRNA, 


Proteins 
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In accordance with the above scheme the ‘transmitted message’ «a, is written in a ‘four-letier 
alphabet’ and the ‘received message’ 6, in a ‘twenty-letter alphabet’ so that for our communi- 
cation channel the number m of elementary signals entering its ‘input’ and the number r of 
elementary signals received at its ‘output’ are distinct (# = 4, and r = 20). Moreover, the 
‘codes’, in which the ‘signals’ « and § are written, have four ‘elementary signals.’ As to the 
coding and decoding operations, i.e., the conversion of the ‘message’ «, into the ‘signal’ a and 
the ‘signal’ 8 into the meSsage’ 8,, persuasive studies have been undertaken comparatively re- 
cently. Of the two operations enumerated above, ‘coding’ is naturally much more elementary 
(and hence also of less interest). In fact, coding involves the simple transformation of a 
sequence of four interchanging ‘letters’ A, G, C and T into a sequence of four ‘elementary sig- 
nals’ A, G, Cand U. Here it is possible to indicate many simple and easily realizable coding 
systems: thus, for instance, the singular ‘complementarity’ of specific base pairs manifested, 
in particular in the structure of ‘double’ DNA molecules, predicts a scheme, in which guanine 
‘produces’ cytosine, cytosine—guanine, thymine—adenine and adenine produces, uracil. 
Apparently, such express coding is widely wised in nature, though possibly it is not fully 
universal. f 

Decoding is of considerably more interest in our case, since it consists of the intricate 
passage from the ‘four-letter language’ of mRNA to the ‘twenty-letter language’ of proteins. 
We have particularly this operation in view when we speak of a ‘genetic code.’ It is clear that 
one mRNA base, which can take in all four ‘values’ A, G, T and U, can by no means contain 
complete information on one of the twenty possible amino acids. Hence, it is necessary to con- 
sider that one amino acid is determined by a sequence of several adjoining bases in an RNA 
molecule : such a base sequence, ‘coding’ one letter of the amino acid alphabet, is usually 
called a codon. Since the number of distinct sequences of two RNA bases equals 4 x 4= 16, 
i.e., it is less than the numter of different amino acids, a codon must contain not less than three 
bases; three bases in a codon, however, suffice, since the number of all possible triplet bases 
equals 4 x 4 x 4 = 64, which is far more than twenty amino acids. 

The first hypothesis on the nature of a genetic code was suggested in 1954 by the well-known 
Amercian physicist and astrophysicist Gamow [100]. Gamow postulated that a given (let us 
say first) amino acid in a protein chain is determined by a certain triplet of successive RNA 
bases, (say) the first, second and third bases, and the following (second) amino acid by a triplet 
that is shifted by one, i.e., by the second, third and fourth bases; similarly the third amino 
acid is determined by a triplet that is shifted by two bases and so on. Such a code with part- 
ially overlapping codons is called an ‘overlapping code’ (see the scheme on the next page, 
where bases are denoted by small ovals and amino acids by asterisks). It was assumed here 
also that an amino acid protein depends only on the composition of the corresponding codon, 
but not on the order of the individual bases in the codon. The main argument that prompted 
Gamow to use this hypothesis was that the number of triplets distinct in composition that can 
be formed from four bases is given by 


(3) ' (>) + (1) =” 


the number of triplets the number of triplets the number of triplets 
of mutually distinct containing two identi- of three identical bases 
bases cal bases 


+Thus, for instance, there are many viruses, in which the long RNA molecules replace DNA 
molecules as a primary genetic material. Hence the ‘input message’ a, is written here from the 
very start in the ‘alphabet’ A, G, T, U, 
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The ‘overlapping code’ of Gamow was found to be not in agreement with reality, and the 
same was true of the ‘nonoverlapping composition code’ (see the accompanying scheme) sug- 
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gested by Gamow and Yeas [102] (in this code an amino acid protein is determined .also by 
the composition of a corresponding codon, but codons do not overlap with each other), How- 
ever, the clarity in Gamow’s formulation of the main problem of protein synthesis in a living 
cell played a significant role in further development of this branch of molecular biology. The 
problem can be described as that of the ‘translation’ of a signal 6 written in the four-letter 
RNA language into a message f, written in a twenty-letter protein language, which is consist- 
ent with the experimental data. 

At one time the idea of a ‘code without commas’ introduced by Crick and his group [88] 
competed with the ‘composition code’ of Gamow and Yeas. Such codes fora reasonably long 
time were discussed widely by a number of scientists (see, for example, the paper [104] written 
jointly by the mathematicians Golomb and Welch and the geneticist Delbriick). Here the 
term ‘code without commas’ is understood slightly different from that on p. 140—where this 
was used to mean an arbitrary uniquely decipherable code (each uniform code consisting of 
only three-letter codons evidently satisfies the last condition). But if we assume that a code is 
nonoverlapping, then it is not clear how the end of one codon and the start of next is discern- 
ed. In fact, the same sequence of bases, say... AGGCTCA..., can be divided variously 
into three-letter ‘codons’; it can be ‘read’ either as...(AGG)(CTC) (A...,oras... AG) 
(GCT)(CA..., oras... A) (GGC) (TC A). ... . There are at least three different 
possible ways to avoid the uncertainties that arise. In principle, there can be some particular 
‘initiation mark’ indicating the starting point of a codon sequence.f It is also possible that a 
special base sequence exists (it perhaps contains a larger or smaller number of bases than the 
codons do) that separates the individual codons from each other—such base sequence is then 
deciphered as a ‘comma’ separating the ‘words’ (codons) from each other. Finally, commun- 
ication theory specialists are also aware of ‘codes without commas’ such that an arbitrary 
sequence of ‘letters’ (in our case, DNA bases) admits just one possibility for a meaningful 
reading, while any other way of dividing this letter sequence into individual ‘words’ leads to 
a sequence of meaningless letter combinations. 

It is clear that a ‘code without comma’ thus defined must be ‘incomplete’ —there must exist 
in it letter sequences which designate no ‘words’ (constitute no codons), Accepting that every 


ftLet us note here that apparently this particular variant is realized in nature. There are 
special ‘initiation’ and ‘termination’ marks indicating the initiation and termination of a ‘gene’ 
—a base sequence that codes a specific protein produced in a given cell. In many cases, differ- 
ent genes are also divided by a definite series of bases which contain no information about 
any amino acid of the cell but play a distinct biological role, 
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codon consists of three bases (a triplet code), it is easy to determine the greatest possible num- 
ber of intelligible codons. It is clear that a ‘triplet’ consisting of three identical ‘letters’ (bases), 
(say) A A A, cannot have a sense, because otherwise a long sequence of corresponding ‘letters’ 
.--44AAAAAAA... can be read sensibly, starting from any place. The remaining 
64 — 4 = 60 distinct triplets can be divided into 20 groups of 3 triplets each, obtained from 
each other by the ‘cyclic rearrangement of letters’ (bases); examples of such triplets are AGC, 
GCA and CAG, or CCT, CTC and TCC. It is evident that out of these three triplets just one 
can make sense, because otherwise it would also be impossib!e to determine uniquely from 
what place it is necessary to start the reading of codons ina long sequence of identical triplets 
of one of these forms. Thus the largest possible number of sensible codons in the case of a 
triplet code without comma cannot exceed 60-+-3 = 20. It can also be shown that in fact it is 
exactly equal to 20. This fact provided Crick and the researchers sharing his viewpoint with a 
Strong argument in favour of the hypothesis that genetic code is a ‘code without comma.’ 

The solution of the problem of the structure of ‘genetic code’ was, however, obtained not 
on a writing table but directly in the laboratories. In the early sixties, a group of biochemists 
led by Marshall W. Nirenberg succeeded in showing that a synthesis of protein-like amino 
acid chains can be accomplished experimentally even in a cell-free system (i.e., in the absence 
of living cells). The system was made by breaking open cells of some bacillus, extracting from 
them ribosomes and all the basic components of a cytoplasm medium and adding the obtained 
material to a synthetic RNA of a definite composition, which in the process of protein synthesis 
enacts the role of a messenger RNA of the living cell. In the first such experiment carried out 
by Marshall Nirenberg and Heinrich Matthaei, the synthetic RNA contained only one repeti- 
tive uracil base; here was observed the synthesis of an artificial protein consisting of repetitive 
amino acid phenylalanin (phe). Thus the RNA... VUUUUUUUU... generated the 
proteins... phe phe phe . .. , which implies that if a code is a triplet, then the amino acid phe 
must correspond to the codon UU U. Similarly, it was shown that the amino acid proline 
(pro) corresponds to the codon CC C. 

During the sixties a vigorous ‘attack’ was launched on the problem of the genetic code by 
the numerous biochemical laboratories of the world. Among the participants in this campaign, 
besides Nirenberg and his associates (among whom Philip Leder played a very important role), 
we must mention the India-born scientist H. Gobind Khorana and the Mexican scientist 
Severo Ochoa, both working in the USA. We shall not dwell upon this at length here, and refer 
the interested reader to the relatively old surveys of Gamow, Rich and Ycas [101], describing the 
initial stages of the endeavour to decipher the genetic code, to the (also sufficiently early)popular 
articles of Crick and Nirenberg [87], intended for the non-specialists, and particularly to the 
self-contained monograph of Ycas [175], listing more than 800 references. The researches of 
many scientists have established that the genetic code is indeed a triplet and nonoverlapping;t 
that it is ‘degenerate’ in the sense that several different codons directly correspond to a part- 
icular amino acid; that there are ‘nonsense’ (i.e., carrying no genetic information) triplets, which 


¢The genetic code is non-overlapping in the sense that two successive codons of a gene do 
not overlap but occupy two adjacent base triplets. However, it was discovered recently (see 
[72]) that in the middle of a gene an ‘initiation mark’ can also appear which is shifted by one 
or two bases relative to a codon beginning of the primary gene. This initiation mark indicates 
the beginning of a new gene which overlaps with the first one; all the codons of the new gene 
are shifted in relation to codons of the first gene, i.e., each new codon overlaps two adjacent 
codons of the first gene. This discovery is quite interesting, but it dogs not affect any fore- 
going result related to the biological code. 
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in general are not codons in the sense that they do not correspond to any amino acids.t 
The accompanying table from [175] shows the genetic code as interpreted by contemporary 
scientists (a dash in the left column implies that the related triplet is not a codon). 


Codon Amino Codon Amino Codon Amino Cedon Amino 
Acid Acid Acid Acid 
UUU phe UCU ser UGU cys UAU ee 
UUC phe UCC ser UGC cys UAC tyr 
UUA leu UCA ser UGA — UAA — 
UUG leu UCG ser UGG try UAG _— 
CUU leu CCU pro CGU arg CAU his 
CcUC leu CCC pro CGC arg CAC his 
CUA leu CCA pro CGA arg CAA gln 
CUG leu CCG pro CGG arg CAG gin 
AUU ilu ACU thr AGU ser AAU asn 
AUC ilu ACC thr AGC ser AAC asn 
AUA ilu ACA thr AGA arg AAA lys 
AUG met ACG thr AGG arg AAG lys 
GUU val GCU ala GGU gly GAU asp 
GUC val GCC ala GGC gly GAC asp 
GUA val GCA ala GGA gly GAA glu 
GUG val GCG ala GGG gly GAG glu 


4.4, Transmission of information over noisy channels 


In sections 1 and 2 of this chapter we sketched briefly via an example from 
telegraphy the basic concepts and results from the general theory of transmission 
of information over a communication channel. It was, however, always implied 
there that signals were transmitted over a communication channel without any 
distortion. But in practical communications this never occurs; there is always 
a possibility of some noise, which causes distortion of the signals in the trans- 


+But, nevertheless, they are genetically important (in particular, they can indicate the 
beginning and the end of a gene, i.e., can play the role of initiation or termination marks). 
In this context, see [175, Chap. VIII]. 
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mission process. A passing reference to this was already made in Section 4.3 in 
connection with the analysis of the performance of a communication channel 
transmitting continuous signals (see pp. 247-248). In this section we revert to 
the simple scheme of discrete communication channels considered in Sections 
4.1 and 4.2, i.e., we assume that only a finite number of distinct ‘elementary 
signals’ of constant duration are transmitted over the channel. (The simplest case 
is of course that in which there are only two distinct signals, on-current and off- 
current.) However, contrary to Sections 4.1 and 4.2, we shall no longer ignore 
the influence of noise. This means that we shall take note of the possibility that 
the given elementary signals, in consequence of the distortion induced by noise, 
may be erroneously interpreted at the receiving end as some different elementary 
signal (for example, on-current may be misinterpreted as off-current). Let us now 
consider the application of information theory to this more complex (byt, on the 
other hand, also more real) case of information transmission. 

Following Section 4.2, we assume for simplicity that the successive ‘letters’ of 
the message are mutually independent, and the a letters of the alphabet are 
characterized by definite probabilities p,, p.,..., Pn of the appearance of some 
letter at any place. We consider 2 communication channel, in which m different 
elementary signals A,, A,,..., Am are used for transmission, and L such signals 
are transmitted per unit of time (i.e., the duration of one signal is 7 = 1/LZ). 
Then, according to the main results of Section 4.2, it is possible to transmit in- 
formation across a noiseless communication channel at a rate arbitrarily close to 
the quantity 


C ee 
=F letters per unit time 
(where 


C=Llogm 


is the capacity of the communication channel, and 
H = —p, log py — Pz log pe —. . . — Pn 108 Pn 


is the entropy of a single letter of the message to be transmitted); however, a 
transmission rate exceeding v can never be attained here. In order to attain a 
transmission rate extremely close to 2, the only requirement is to partition the 
message into sufficiently long blocks and encode individual blocks by means of, 
for instance, the Huffman optimal code or some nearly optimal code (say, the 
Shannon-Fano code, or any code with word lengths |, such that —log p,/log m 
< ls < —log p,/log m + 1). In other words, the prescription here.is to make use 
of a code for which the redundancy in the encoded message is the least possible 
or at least sufficiently close to it, 
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In the case of a noisy channel, the situation is rather different. It is natural 
that in this case only the redundancy in the sequence of transmitted signals en- 
ables us to effect correct reproduction of the transmitted information on the 
basis of the output data. It is known that when the noise is appreciable we 
even Strive greatly to increase artificially the message redundancy (say) by repeat- 
ing every transmitted word a few times or interchanging every letter of the 
message by the individual word starting with that letter (transmission ‘through 
letters’). Hence it is clear that the use of a code leading to the least redundancy 
in the encoded message is not appropriate here and the transmission rate has to 
be lowered, But this naturally raises the question: to what extent? 

To answer this question, it is necessary to start with the examination of a 
mathematical model of a noisy communication channel. We first assume that 
this channel uses m distinct elementary signals 4,, A4o,..., Am, but due to noise 
the transmitted signals A; (where i -= 1,2,..., or m) can sometimes be inter- 
preted at the receiving end of the channel as some other (different from Aj) signal 
Ay, For a quantitative description of this situation, we assign probability pa,(A,) 
to the event of obtaining correctly the signal A, at the output if A, is transmitted 
(so that pa,(A,) is the probability of transmission of A, without error) and the 
probabilities pa,(A2), pay(A3), ..-, pa\(Am) to the events that the transmitted 
Signal A, is interpreted at the output as A2, As,..., Am. It is further required to 
assign the probabilities p4.(A,), p42(A), ... , PA(Am) to obtaining the signals 
A;, A2,..., Amat the channel output when the signal A, is actually transmitted, 
and so on up to the probabilities p4m(Aj), DAm(A2), . - » » PAm(Am) Of obtaining the 
signals A,, As,..., Amat the output if the signal Am is actually transmitted. 
The probabilities 


pa,(A;), PAr(Aa), . - . » PAs(Am) ; 
pa{A,), pA Aa), Ce) Pat Am) ; 


paw(A,), PAm(A,), a ey PAm(Am) 


in this case characterize the probabilities of the channel noise, i.e., they are 
mathematical characteristics of the channel. Thus, a complete mathematical 
description of a noisy channel consists of prescribing an integer m (defining the 
number of distinct elementary signals that can be transmitted over the channel), 
a number L or t = 1/L (defining the number of signals transmitted per unit of 
time or signal duration) and also m? positive numbers pa;,(A;) [which obviously 
must satisfy the m conditions: pa,(Ay) + pai(A,) +... + pai(Am) = 1 for every 
value of ij = 1,2, ..., m] characterizing the effect of noise. In this connec- 
tion, we recall that in Sections 4.1 and 4.2 various communication channels were 
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characterized only by the numbers m and L. 

The foregoing description of a noisy communication channel can also be 
generalized by admitting that noise may sometimes so distort the transmitted 
Signal that at the output it cannot be identified with any of the transmitted m 
elementary signals A;. In order to take care of such a possibility it is expedient 
to assume that at the output there may be obtained not necessarily those very 
m elementary signals A,, A.,...,Am Which were transmitted through the 
channel, but some other r (where r can be greater than, less than, or equal to 
m) elementary signals B,, B,,..., B, (all or some of which may be distinct 
from A,, A2,..., Am, see Bxample 4° below). In this case, noise can be charac- 
terized by the mr positive numbers 


Pa,(B), pay(Ba), . . « + PAr(B,); 
pal B,), pAe( Be), ... 5 PAa(B,); 


PAm(B,), Pam(B), eee sg PAn( Br); 


satisfying m conditions: p4,(B,) + pa,(B.) +... + pa(B,) = 1 for every i= 1, 
2,..., m. Here pai(B,) denotes the probability of obtaining the signal B; at 
the output if the signal A, is actually transmitted. The communication channel 
is now characterized by the integers m and r, the number L (or t = 1/L) and 
mr numbers p4,(B;). The use of such general parameters for communication 
channels does not at all complicate the subsequent arguments in comparison 
with the case when it is assumed that r = m and the signals obtained at the 
receiving end coincide with the transmitted signals A,, Az, ..., Am. It is, there- 
fore, exactly this general approach that we shall follow. 

Let us now assume that p(A;), p(Ae),..., p(Am) are, respectively, the prob- 
abilities of the transmission of the signals A,, 4o,..., Am (here, obviously, 
P(A.) + p(A,) +... + p(Am) = 1). In such case, the experiment 8 consisting 
of the determination of what specific signal is transmitted has the entropy H(§) 
given by 


H(8) = —p(Aj) log p(A,) — p(A,) log p(A2) — . . . — p(Am) log p(An). 


The experiment «, which consists here of finding what signal is obtained at 
the output, obviously has r outcomes dependent upon the outcome of 8. More- 
over, the conditional probability of the outcome B; of this new experiment, given 


+Generally speaking, we can generalize slightly further even this parametrization by assum- 
ing that an arbitrary (i.e., (say) infinite or even continuous) set of signals B can be obtained 
at the output. We can carry over to this case almost all results indicated below; however, a 
number of equations now appear to be more complex. Due to this reason, the indicated 
generalization of the notion of a communication channel will not be drawn upon in the 


following. 
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that B has the outcome A, (where i= 1, 2,...,m; 7=1, 2,...,7), is just 
equal to pa,(B;). The average amount of information about experiment 8 con- 
tained in the experiment « is 


I(«, 8) = H(B) — Ha(8), 


where Ha(f) is the conditional entropy defined by the equations on pp. 61-63 
(with the replacement of k and / by m and r in these equations). It is clear that 
the information I(«, 8) never exceeds the entropy H(§) of 8, since H(@) is equal 
to the maximum information about 8 that can be obtained from any experiment 
(this information is contained, for example, in @ itself). The information J(a, B) 
equals the entropy A(§) if and only if the outcome of 8 is uniquely defined by 
the outcome of g, i.e., if and only if the received signal allows us to determine 
uniquely what signal has been transmitted (from a practical viewpoint this means 
that here noise does not affect the reception). The information J(«, 8) is zero 
when «@ does not depend on § (i.e., when the signal received does not at all depend 
on what signal is transmitted—in other words, when, because of quite strong 
noise, the transmission of information factually does not take place). 

We now recall that the channel capacity C of a noiseless communication chan- 
nel is defined in Section 4.2 as the greatest amount of information that can be 
transmitted through this channel per unit of time (see p. 173). Let us extend this 
definition to the case of noisy channel. For such a channel, the average amount 
of information conveyed by one elementary signal received at the channel output 
is given by 


I(a, B) = H(8) — He(3), 


i.e., it depends on the probabilities p(A;), p(A2), . . . , p(Am) that the signals A, 
Ay, ..., Am are transmitted. Let 


c = max I(«, 8) 


be the maximum value of I(«, 8) which can be attained by the variation of 
probabilities p(A,), p(A2), .. . , pP(Am), and suppose that this value is achieved 
for the values p°(A,), pA), ... , p'(Am) of these probabilities (examples of the 
explicit evaluation of the quantity c and the probabilities p°(A,), p%A2),..., 
p°(A,) will be given below). The quantity c is defined as the largest amount of 
information that can be obtained at the output when a single elementary signal 
is transmitted. If it is desired to obtain the greatest amount of information 
transmitted during a definite time interval (say, in a given unit of time), then it 
is natural to act as follows. During the indicated time interval, we always select 
the values of transmitted elementary signals with the same probabilities p(Aj), 
p%Az),..-, P'(Am) regardless of what specific signals were transmitted previously. 
(For the justification of this method, see the text in small print on pp. 297-298, 
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where it is rigorously proved that for any transmitted sequence of mutually 
dependent signals the total amount of transmitted information cannot exceed the 
amount of information transmitted when the best independent signals are used.) 
In such a transmission each elementary signal transmits c units of information, 
i.c., the amount of information conveyed per unit time is given by 


C = Lc = L max I(a, 8). 


This quantity C is called the capacity of a noisy channel. Since the maximum 
of J(a, 8) cannot exceed H(8) and A(8) is never greater than log m (see pp. 48- 
49), it is clear that the capacity of a noisy channel is never greater than that of 
a noiseless channel through which the same number of elementary signals can 
be transmitted per unit time and which uses the same number of distinct signals. 
Consequently, noise can only decrease the channel capacity, which agrees well 
with the inference dictated by common sense. 


Examples 


1°. Let us begin with the case when r = m, the signals B,,..., B, coincide 
with A,,..., Am and pa.(A,) = 1 forj =1, and hence pa,(A;) = 0 for j 4i. 
Here, we always receive the same signal as that transmitted (noise does not affect 
the transmission or is even totally absent) and hence 


H.(8) = 0 and c = max I(a, 8) = max H(8) = log m. 


This maximum value is achieved, as we know, when all the possible signals to be 
transmitted are equally likely, so that here p°(A,) = pA.) =... = p%{Am) = 
1/m. Thus in this case, C == Llogm. Hence it is seen that the definition of 
the capacity of a noiseless channel derived in Section 4.2 is a particular case of 
the more general definition considered here. 


2°. Suppose that two elementary signals (Say, the on-current 4, and off-current 
A,) can be transmitted through a communication channel and the same two 
signals A, and A, are obtained at the receiving end. Further, suppose that p and 
1 — pare the respective probabilities of receiving any of the signals with and 
without error. We have 


pa(A;) = pa,(Az) = 1 — p, pa,(Ag) = pa(4,) = p, 
so that the table of conditional probabilities given on p. 260 here has the form 


P» I — Pp. 


The corresponding communication channel is called a binary symmetric channel; 
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it is schematically shown in Fig. 18, where arrow-heads indicate the received 
signals into which A, and A, may be transformed, and along the lines of the 
arrow are written the probabilities of the corresponding reception. 


P 
P 
A,0 04, 
t—p 
Fig. 18. 


To evaluate the quantity c, we use the equality 
I(«, B) = H(«) — Hale). 


From the table of conditional probabilities given above, it is seen that if A: is 
transmitted, then we obtain at the receiving end the same signal A, with prob- 
ability 1 — pand the signal A, with probability p; if, however, A, is transmitted, 
then we receive A, with probability p and A, with probability 1 — p. Hence 


Ha,(«) == Ha,(«) = —(1 — p) log (1 — p) — p log p = h(p), 
where h(p) is the function introduced on p. 49, and 


He(«) = p(A;)Hay(x) + p(Ae)Ha,(«) = A(p), 


independently of the values of probabilities p(A,) and p(A,) [because we always 
have p(A,) + p(A,) = 1]. Consequently, in this case Ha(«) does not at all depend 
on p(A,) and p(A,) and for calculating 
c = max I(«, B) = max [H(a«) — Ha(«)] 

it is required only to determine the maximum value of H(«). But H(«) is the 
entropy of an experiment « with two possible outcomes, which can in no way 
exceed | bit (see p. 49). On the other hand, the value H(o) = 1 is certain to 
be attained when p(A,) = 4, p(A.) = 3, since in that case clearly both outcomes 
of « have identical probabilities [in the general case, these probabilities obviously 
equal 

q(Ax) = p(A,)(1 — p) + plA2)p, 
and 

q(A.) = p(Ai)p + p(A2)(1 — p)]. 
Hence, in our case it follows that 


1 
P°(A,) = pA) = 3 


c== 1+ (1 — p) log (1 — p) + p log p = 1 — h(p), 
and 
C = Le = L[1 — A(p)]. 
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We have thus derived explicit equation that determines the dependence of the 
capacity of a binary symmetric channel on the probability p of an erroneous 
transmission. The graph of the function C(p) is given in Fig. 19. This function 
attains a maximum (= L) when p = 0 (i.e., in the absence of noise) and when 

= 1 (i.e., in the case of noise which transforms each transmitted signal A, 


Ch 


Fig. 19. 


into A, and vice versa; it is obvious that such noise does not hinder us from 
understanding what signal has been transmitted). It is also clear that, generally, if 
P > 3} then we can always replace in the received message every signal A, by A; 
and vice versa; this transforms the given channel into a communication channel 
with error probability 1 — p< 4. Hence it is obvious that the replacement of 
p by 1 — p cannot affect the value of the channel capacity C (this is seen also 
from the formulae obtained above), i.e., the graph of the function C(p) must be 
symmetric in the line p = 4. When p = 4, the channel capacity C is zero; this 
is related to the fact that for p = 4, regardless of what signal is transmitted, we 
get at the receiving end both signals A, and Az with probability 4, so that the 
received signal contains no information about what signal has been transmitted. 
When the values of p range between 0 and 3 (or between } and 1), we have a 
positive channel capacity less than L; moreover, this channel capacity rapidly 
decreases for increasing p (when p < }) or 1 — p (when p> }). Thus, for in- 
stance, if Z = 100, then for p = 0.01 (i.e., when out of 100 transmitted binary 
signals on the average one is received in error) C = 92 bits; when p = 0.1 (i.e., 
if 10 out of 100 signals suffer from noise distortion) C = 53 bits, and when 
p = 0.25 (i.e., if one fourth of all received signals are wrong) C = 19 bits. 


3°. Let us now consider a more general example of communication channels, 
using m distinct elementary signals A,, Az,..., Am, where the same signals 
are also obtained at the receiving end of the channel (i.e., r = m, B; = A; for 
all i) and the probability of error-free transmission of each of these signals is 
1 — p, but in the case of erroneous transmission the transmitted signal may with 
identical probability, i.e., p/(m — 1), be taken as any of m — | signals different 


fIn place of a communication channel, we can use equally successfully the result of the 
throw of a coin and consider the signal A, (resp. 4.) to have been received when the ‘head’ 
(resp. ‘tail’) turns up. 
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from it. The table of conditional probability assumes here the form 


fe St Ps, 

1 PT mod mT? 

P eC es eee Ps. 

m1?! Dy Ay 
P ‘Sep == ps 


m—1’m—1°’ m—-1’° 


and the corresponding communication channel is called an m-ary symmetric 
channel. Let us again make use of the representation of J(«, B) in the form 
H(«) — Ha(«). Then, obviously, 


Ha(a) = Ha(e) =... = Ham(a) 


Sei 25) = n= P — 
(1 — p) log (1 — p) — (m — 1) x ~~ —- loge» 


and, consequently, 


Pp 
m—1 


He(a) = —(1 — p) log (1 — p) — p log . 

Thus, as in example 2°, it is again ascertained that Ha(«) is independent of 
the probabilities p(A,), p(A2), ..., p(Am) and for finding the channel capacity 
the only requirement is to determine the maximum value of H(«). In complete 
analogy with Example 2°, this maximum value is found to be log m and is 
attained when all outcomes of experiment « (i.c., all possible values of received 
signals) are equally probable. The last condition is obviously satisfied when the 
probabilities p(A,), p(A2), ..., p(Am) of sending the signals A,; Ao,..., Am 
also are all equal to each other. Hence 


P(A) = PMA) =... = p{Am) = =, 


¢ = max I(a, B) = log m+ p tog —*— + (1 —p) log (1 — p), 


and 


C= L| log m + p log + (1 =p) tog (1 — p) | 


P 
m—- 1 

The graph of the function C(p) (for m = 4) is given in Fig. 20. This func- 
tion attains its maximum value L log m when p = 0 (in the absence of noise), 
and when p increases from 0 to p=(m — 1)/m it reduces smoothly to zero. 
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The fact that the channel capacity for p = (m — 1)/m is zero is quite natural: 
in this case for any signal to be sent we can obtain each signal A,, 4o,..., Am 
with equal probability 1/m at the receiving end, so that no information is con- 
veyed here regarding the signal to be sent. With further increase of p we again 
obtain (truly, not large) a positive channel capacity; in this case, if we receive 


C 


P 
0 01 02 03 04 05 06 0.7 U8 O09 1.0 


Fig. 20. 


signal A;, we can infer that the probability of the transmission of any signal 
other than A; is larger than the probability. of the transmission of Aj, i.e., we 
have nevertheless some information as to what specific signal is transmitted. 
Due to this fact, the channel capacity again increases when p increases from 
(m — 1)/m to unity, namely it becomes L log [m/(m — 1)] for p = 1. 


4°. Consider again a binary communication channel through which two sig- 
nals A, and A, can be transmitted; however, it is now assumed that the signal 
obtained at the output may sometimes be interpreted as one of these two signals 
but occasionally the signal may be so distorted that it is completely impossible 
to identify it. In the latter case, it is appropriate to suppose that an altogether 
new signal A, is received, whose appearance can be interpreted as an event: the 
transmitted signal has been completely erased and cannot be deciphered (hence 
such communication channel is called the binary erasure channel). We shall con- 
fine ourselves here to the simplest case of a binary symmetric erasure channel, 
for which the probability of any of the transmitted signals A, and A, to be 
erased is one and the same number gq (i.-e., pa,(A3) = pa2(A3) = g); moreover, if 
the erasure does not arise, then both the signals A, and A, will be deciphered 
correctly at the output with one and the same probability 1 — p — q, and with 
probability p they will be confused (i.e., signal A, will be confused with signal 
A, and vice versa). Thus, in the case of a binary symmetric erasure channel 
we have m = 2, r= 3 and the corresponding table of conditional probabilities 
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Pa(B,) = pa,(A,) takes the form 


1—p—gq, P, q; 
P, I—p-—4q, q 
(see Fig. 21). 
1-p-q 
A, q oA, 
p 
OAy 
Dp 
q 
4 
: I—Pp—q Ae 
Fig. 21. 


It is clear that no matter what signal is transmitted, we obtain at the receiv- 
ing end the signals A, with probability g, whereas of the two remaining signals 
one has probability 1 — p — q and the other probability p. Consequently, 


Ha,(«) = Ha,(«) = —(1 — p — q) log (1 — p — g) — p log p — q log gq, 
and hence 
Ha(«) = —(1 — p — q) log (1 — p — gq) — pp log p — q logg, 
so that 
I(a, 8) = H(«) + (1 — p — q) log (1 — p— q) + p log p + q log g. 


Since experiment « can have three outcomes A,, A, and A, in the considered 
case, we have H(«) < log 3; hence 


c = max I(x, 8) < log 3 + (1 — p — q) log (1 —p— 4) + plogp + q logg. 


But can the entropy of « be equal to log 3 ? It is easy to see that, in general, it 
cannot, whatever the probabilities p(A:) and p(A,) of signals A, and A,. In fact, 
the equality H(«) = log 3 is satisfied if and only if all the three outcomes of« are 
equally probable (i.e., all have probability 4). In our case, however, the prob- 
ability of outcome A, (‘erasure’) with any choice of p(A,) and p(Az) is equal to 
the given number gq, which is the channel characteristic and can, of course, be 
quite different from 4. Hence, the entropy of « has the form 


H(«) = —q, log q, — 2 log q, — log 4, 


where q is fixed, but g, = p(A,) (1 — p — q) + p(A,)p and gq. = p(A;)p + 
p(A.)(1 — p — q) are the probabilities of obtaining the signals A, and Ap, res- 
pectively at the receiving end, which depend on the values of p(A;) and p(A,) = 


4.4, TRANSMISSION OF INFORMATION OVER NOISY CHANNELS 269 


1 — p(A,). It is clear that g, + g, = 1 — q for all values of p(A,) and p(A.). 
But it is easy to see that the maximal value of —q, log q, — 4: log q,, where 
9, + 92 = 1—q (and q is a fixed number satisfying the obvious condition 
0<q< 1), is attained when g,; = g, = (1 — g)/2.¢ In addition, it is also 
easily verifiable that the values g, = gg = (1 — q)/2 are in fact possible: for 
this it is only necessary to suppose that p(A,) = p(A2) = 3. This yields 


P(A.) = p%(As) = 3 
c¢ = max I(a, 8) 


1— 
= —(1 — q) log 7} i + (1 —p—4q) log (1—p—gq)+plogp 


= (1 — q) [1 — log (1 — q)] + (1 — p — q) log (1 — p — q) — p logp 
and, hence, 
C = L{(l — g) {1 — log (1 — g)] + (1 — p — 4) log (1 — p — 4) + p log p}. 
The channel capacity C obtained depends on two numbers p and g, which 
characterize probabilities of different types of errors in the given case. It iseasy 
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to show that C decreases for both increasing p and q (subject to the natural 
assumption that p < 4). We further note that in an actual erasure binary com- 


"In fact, by adding the constant term (1 — q) log (1 — ¢) = (4, + 2) log (1 — g) to 
—g, 10g q, — 92 log q, and then multiplying the sum obtained by the constant factor 1/(1 — q), 
we get the expression 


No 1 42 
|—@q eT ¢ l—gq l—q 


which represents the entropy of an experiment with two outcomes having the probabilities 
gil(1 ~ q) and g,/(1 — g). This entropy obviously takes its largest value when q, = q2; consequ- 
ently, the largest value of the original expression —gq, log 9, — 92 log q2 is also attained when 


T= Wy 
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munication channel the inequality p < q is usually valid, i.e., the probability of 
the transmitted signal being so distorted that it becomes impossible to identify 
is usually larger than the probability of that distortion due to which it is found 
to resemble in form the second of the used signals. In a series of cases, the 
probability p is generally found to be so small that it can be completely ignored, 
i.e., it may be considered that the only possible detrimental distortion of signal 
due to noise is the one which makes the signal impossible to decipher at the 
output (i.e., at the output it gets ‘erased’). If we agree to consider that p = 0, 
then the formula for the channel capacity C assumes the singularly simple form 


C= L1—4q) 


(see Fig. 22). The preceding result is quite natural: when p = 0, out of L 
binary signals transmitted over our communication channel per unit time, on the 
average Lq signals are ‘erased’, i.e., do not convey any information, whereas the 
remaining L({1 — q) signals are accurately deciphered at the receiving end, so 
that each of them transmits exactly 1 bit of information. 

The circumstance that, in all the preceding examples the channel capacity C 
was attained when the probabilities of the transmisssion of any of the employed 
signals was taken to be equal to each other, is obviously accidental. This is 
explained merely by the fact that for simplicity of calculation in all these exam- 
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ples the table of channel conditional probabilities ps; (B;) was chosen to be quite 
symmetric. In order to illustrate that a different situation may also hold, we 
give one more result related to the following slightly more complex example 
first suggested by Shannon [21]. 


5°. Suppose that three elementary signals 4,, 4, and A, can be transmitted 
over a communication channel, where the first signal differs considerably from 
the other two and can always be unmistakably interpreted at the output of the 
channel, but each of the other two signals has probability 1 — p of being inter- 
preted correctly and probability p of being misinterpreted as being its opposite. 
In other words, we consider that m = r = 3 and that the table of conditional 
probabilities ps;(A,) has the form 


1 0, 0; 


0, 1 — P; P; 
0, P ae 
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(see Fig. 23). Consequently, 
Ha,(«) = 0, 
Ha,(«) = Ha,(x) = —(1 — p) log (1 — p) — p log p = Alp), 


and 
Ha(«) = [p(Az) + p(As)] A(p), 


I(«, B) = —q(A,) log g(A1) — 9(A2) log g(Az) — (Ag) log q({As3) 
— [p(A2) + p(As)] h(p), 


where q(A,) = p(Aj), q(A2) = p(Ae) (1 — p) + p(As)p and g(A;) = p(A.)p + 
p(A,) (1 — p) are the probabilities of the outcomes A,, A, and A; of «. 

Note that Ha(«) does not depend on all the three probabilities p(A,), p(A,) and 
p(A,) but only on p(A.) + p(A,) = 1 — p(A,). Applying the arguments given 
in the footnote on p. 269, it is easy to show that with p(A,) = q(A;) fixed the 
entropy H(«) (and, hence, also the information J(a, 8)) is the largest if the prob- 
abilities g(A,) and g(A,) (and, consequently, also p(A,) and p(A;)) are equal to 
each other: 


p(As) = p(Ay) = q(4e) = (Ae) = LA), 


The only requirement now is to determine for what value of p(A,) the expression 


Ie, 8) = —p(As) log p(y) ~ [1 — p(4,)) Hog PAD + cpp] 


will be the largest, where p is a given nonnegative number not exceeding unity. 
This problem is quite complicated if only methods of elementary mathematics 
are used but is easy to solve with the aid of differential calculus.t It is found 
that the desired value of p(A,) is 


1 
0 = HH Lt" 
Thus, 
p(A,) = : 


*It is known that the point x of the segment 0 < x < 1, at which the function 
= —x log x ~ (1 — x) [log {(1 — x)/2} — log a] 


(where a = p?(1 — p)!-? and all logarithms are taken to the base 2) takes its greatest value, 
cgincides with that point at which the derivative of this function vanish¢s, 
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P(] ~— p)i-2? 
P*(A2) = p(A;) = FETE 2 


Substituting these probabilities in the expression for J(a, 8) and multiplying the 
result by L, the number of signals transmitted per unit time, it is easy to find 
that the capacity of our communication channel is given by 


C = L log [1 + 2p%(1 — p)!-?]. 


The graph of the function C = C(p) is given in Fig. 24. For p= 0 this func- 
tion takes its greatest value: it is easy to show that p%(1 — p)!-? > 1 asp—>0 
and, consequently, here p(A,) == p%(Ag) = p (Ay) = 4 and C = Llog 3. This 
result is, indeed, obvious: for p = 0 we havea simple noiseless channel using 
three different elementary signals (see Example 1°). When p increases from 0 
to $ the channel capacity C decreases, since in the transmission of the second or 
third signal a part of the information is lost because of the presence of noise: 


Cc 
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The probability p%A,) is, therefore, found to be slightly larger than § (i.e., it is 
advantageous here to transmit the first signal more frequently than the second 
or third), For p= 4 the channel capacity takes its smallest value, namely 
C = L (since (4)! x (4)! = 4). For attaining this channel capacity, the first 
signal ought to be transmitted in half of the total cases (p°(A,) = 4), and the 
second and third in the remaining half of the cases. Factually, the signals A, 
and Az ought to be considered here as one common signal, since at the output 
it is impossible to identify precisely one from the other and we simply say that 
either of them but not A, was transmitted. (Hence the case p = 3} is equivalent 
to the case of a noiseless channel using two different signals.) When p further 
increases from } to 1 the value of C(p) again increases, where C(p) = C(I — p) 
(by the same criterion as in Example 2°). 

Another example of a communication channel for which the probabilities pA,) 
are not equal among themselves can be obtained by assuming that m =r = 2 
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but that the probabilities of error in the transmission of the two given signals 
are not equal to cach other (the case of a binary asymmetric channel). In this 
case, however, all formulas are found to be considerably more involved than 
those in the foregoing cxamples. We shall not, therefore, dwell upon them. 


Let us now assume that the channel capacity C is known. Inthe case of a 
noiseless channel, as seen in Section 4.2, the value of C yields an accurate esti- 
mate of the greatest possible transmission rate of a message over a given channel: 
thus, whatever the coding method, this rate cannot exceed the quantity 


C So 
v= letters per unit time 


(where H is the entropy of a single letter of the message to be transmitted); a 
transmission rate arbitrarily close to v can, however, always be achieved. Ina 
noisy channel, besides the transmission rate, we should also take into account 
the transmission accuracy characterized, for example, by the probability of error 
in determining every individual transmitted letter. It is easy to comprehend that 
if the transmission rate v, (in letters per unit time) exceeds the quantity v = 
C/H (where C as defined above is the channel capacity of a noisy communica- 
tion channel !), then accurate transmission (permitting free of error reconstruc- 
tion of all letters of the transmitted meSsage) cannot take place by any means. 
(This statement is, in fact, a loose formulation of the so-called converse to the 
noisy coding theorem, of which we shall speak more elaborately on p. 282 et 
seq.). Indeed, in an error-free transmission at a rate v,, the amount of inform- 
ation about the letters of a message transmitted through a channel per unit 
time is equal to the total amount of uncertainty of a 2,-letter “block’, i.e., to the 
product v,H (recall that the individual letters are assumed to be independent). 
Consequently, the amount of information transmitted per unit time about the 
code words at the channel input (i.e., about the outcomes of experiment 8) can- 
not be all the more less than v,H (see p. 89). But since »,H > C when v, > 
v = C/H, it follows from the very definition of the quantity C that an error- 
free transmission of a message at a rate v, > v letters per unit time cannot be 
accomplished. Starting from these reasonings, we can also evaluate precisely 
the lower bound of error probability attainable in the ‘best possible’ transmission 
of a message at a given rate v, > v (cf. p. 282 et seq.) 

It may be remarked further that if no restriction is imposed in general on the 
transmission rate of message, then in a majority of cases the probability of error 
in the determination of each transmitted letter can be easily made as small as 
desired. For this it usually suffices that every transmitted signal (or group of 
such signals) be repeated many times. It could, however, be expected that in 
order to obtain a quite low error probability it would be necessary that the 
transmission rate be substantially lowered (such a sharp fall in transmission 
rate will occur, in particular, if the probability of error is decreased by means 
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of the mutiple repetition of all the signals). Strictly speaking, it may at first sight 
seem natural to think that every decrease in the error probability in the deter- 
mination of the transmitted letters must inevitably be related also to a decrease 
in the transmission rate and that an indefinite decrease in error probability can 
by no means be attained without lowering indefinitely the transmission rate. It 
is, however, found that inreality this is not the situation. It has indeed been 
demonstrated by Shannon that for every noisy communication channel we can 
always choose a particular code allowing us to transmit a message at a given rate 
arbitrarily close to 


Cc ba Od 
v= 7 letters per unit time 


(but nevertheless necessarily slightly less than this quantity!) and such that the 
probability of erroneous decoding of each transmitted letter is found to be less 
than any preassigned number « (say, less than 0.001, or 0.0001, or 0.000001). 
Obviously, the code of which we speak here will depend on e and, as a rule, the 
smaller is the e the more complex the code will be. The assertion set above in 
italics generalizes the fundamental coding theorem formulated in Section 4.2 
and may be called the fundamental noisy coding theorem. A vital role in the 
proof of this theorem is played by the direct use of quite lengthy ‘block’ codes 
of a large number of letters; hence, the transmission of a meassage at a rate 
close to v and with a quite small error probability is associated with consider- 
able delay in deciphering each transmitted letter. 

Before we proceed further, it may be remarked here that, exactly as in the 
case of the fundamental! noiseless coding theorem considered in Section 4.2, the 
restriction that individual letters of the text be mutually independent is in fact 
not essential. In what follows, we shall almost dispense with such a require- 
ment and use only a particular related result, according to which, if N is suffic- 
iently large, then out of #% different N-letter blocks (where each letter can take 
n different values) only 2” are ‘probable’ (and have nearly the same probability). 
For the case in which text letters are mutually dependent, this position does not 
hold. However, as remarked on p. 171, in this case also, subject to certain 
quite general conditions, among all possible N-letter blocks, where N is sufficiently 
large, it may be possible to extract a comparatively small portion of nearly equally 
probable N-letter blocks with a probability sum quite close to unity. The total 
number of ‘probable’ blocks of N mutually dependent letters is accordingly stated 
on p. 171 to be of the order of 240% = 28%) where H) is the entropy of an 


N-letter block and H.. = lim H'"’/N is the specific entropy of a single text letter. 
Noe 


Thus, if the text letters are dependent, then in general we need only replace through- 
out in the sequel the entropy H of one letter by the specific entropy H.. smaller 
than H. Moreover, in the case of a transmission rate v, exceeding v = C/H. 
letters per unit time, we can make use of the fact that the total amount of 
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information contained in v,T letters of the transmitted text (where T is the 
transmission time), no matter what 7, cannot be less than v,TH. bits. This 
implies, as can be easily shown, that the italicized statement on p. 274 remains 
valid even in the case of the transmission of a message whose letters are mutually 
dependent but the speed v = C/H letters per unit time is replaced here by v = 
C/H.. letters per unit time. 

We shall now assume again for simplicity that the individual letters of the 
transmitted message are mutually independent (i.e., we shall everywhere use the 
customary entropy H of one letter and not the specific entropy H..). Unfortun- 
ately, even in this case a rigorous proof of Shannon’s fundamental noisy cod- 
ing theorem is quite tedious. Such a proof is absent in [21] which forms the 
starting point for all of information theory. In fact, Shannon [21] confines him- 
self only to the exposition of some general arguments and a highly descriptive 
enunciation of the reasons for which this theorem must hold. A rigorous 
mathematical proof of this theorem was given later by Feinstein (see, for example 
[9]), whose underlying idea differs from the original reasoning of Shannon. A 
rigorous proof of this theorem, closely following the deductions briefly touched 
upon in [21] is contained in Shannon’s [186], in which it is also shown that via the 
same path we can obtain stronger results, which we propose to take up below. 
In the present text, we start with certain very simple reasonings due to Shannon 
in order to initiate the reader into the fundamental coding theorem. Later, on 
p. 290 et seq., we also describe the rigorous methods of its proof, resting on 
deeper reasonings in [21]. In addition, taking into consideration its immense 
importance, we supplement (in small type) at the end of this section (see pp. 298- 
303) one more mathematical proof of this theorem for the particular case of a 
binary symmetric channel that is based on the idea similar to that followed by 
Feinstein. 

Suppose that B is an experiment consisting of the choice (and subsequent 
transmission through a communication channel) of one of m elementary signals 
Ay, Ag, ..., Am with probabilities p°(4,), p°(A2),..., pP%(Am) that correspond 
to the maximum amount of information I(a, B) (i.e., for which the capacity of 
our channel is realized). Shannon’s theorem says that there exists a method 
of encoding a message that enables us to carry out the transmission at a rate 
arbitrarily close to (but slightly less than !) 


c 


v= L i 


letters/unit time, 


where 
c =: H(8) — Hu(8) = H(«) — Ap(a), 
so that the probability of erroneously decoding the received message is small 


(smaller than an arbitrary preassigned small number). Since L elementary signals 
can be transmitted per unit time, the requirement for attaining such a trans- 
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mission rate is that the code word of N-letter ‘block’ contain nearly (but a few 
more than) (H/c)N elementary signals. In fact, here the LT elementary signals 
transmitted during the large time interval T contain roughly LT/(H/c)N = vT/N 
code words corresponding to a message of approximately v7 letters. 

It is known (see pp. 162-)70) that in fact it need be of no concern to us that 
the code words of all nY = 2'°8"*N distinct N-letter messages (where 7 is the 
number of alphabet letters) have a length close to (H/c)N signals. Indeed, only 
24N of these N-letter messages are ‘probable’; as regards the remaining 2!o8 "x N 
— 2HN messages, the total probability of their appearance for large N is quite 
small, and hence even if their code words were appreciably longer, this does not 
noticeably lower the transmission rate (remaining close to L(c/H) letters per unit 
time). We further note that for achieving a high degree of accuracy in trans- 
mission it is only necessary to ensure that the probability of erroneous decoding 
of the received code word remains small for all code words of 2#% ‘probable’ 
N-letter messages (since all the remaining N-letter messages are very rarely en- 
countered and the errors in their restoration make little impact). 

We seek a coding method for which the length of a code word of N-letter 
blocks is (H/c,)N = N, elementary signals;} here c, is a number chosen before- 
hand, which must satisfy the unique condition 


qa<ec 


(but c, may be arbitrarily close to c!). The number of all distinct sequences of 
(H/c,)N elementary signals obviously equals m(#/%)N —. 2(logmia)HN Since 
Cc; <c & H(B) < log m, it is certainly larger than 24" and hence a distinct 
sequence of N; = (H/c,)N elementary signals can be associated as a code word 
to each of 24% ‘probable’ N-letter messages. However, it is further required 
that the probability of erroneous decoding of all transmitted code words remain 
small. This clearly prescribes that the 27% code words used by us must differ 
sharply from each other, for only subject to such restriction can it be expected 
that in spite of the possible noise distortion of transmitted signals, it will be 
possible to distinguish the code words from one another at the channel output 
with sufficient reliability. 

In order to estimate the possible number of such N,-term code words that are 
effectively distinguishable from one another, we may argue as follows. Every seq- 
uence of N, = (H/c,)N transmitted elementary signals 4; (where i = 1,2,...,m) 
iS received at the channel output as a chain of certain N, elementary signals B; 
(where j = 1,2,...,r; seep. 261). Obviously, by transmitting one and the same 
N,-sequence A;,Ai,... Aiy, Many times we obtain at the output many different 


TAs is usual, if the number (/c,)N = N, is not an integer, then it is necessary to replace 
it by an integer closest to it. This remark relates also to all other numbers encountered below, 
which in their own right must necessarily be integers. 
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sequences B;,B,,... Biy,. This fact reflects a random character of noise which 
affects the transmission. However, by transmitting a single N,-sequence Ai, Ai, 
... Aty, we shall obtain at the output different sequences B,,B;, ... Biy, with 
different frequencies: one of them appears here relatively more frequently, others 
quite rarely, however.{ The following arguments enable us to evaluate approxi- 
mately the number of chains B;,B), ... Byy, which, with not too low probability, 
May arise in the transmission of the given chain Ai,A;,... Aiy,. Assume that 
elementary signals Ai are successively transmitted through the communication 
channel, each time choosing a transmitted signal at random (and independent 
of all signals transmitted previously), with probabilities p°(A,), p(A2), .--; 
PAm). In such a case, by what is stated on p. 168, for large N,, among all 
N,-signal chains of the form Ai, Ai... . Aiy, only 27 ()%2 chains are ‘probable’ 
(all having nearly the same probability), while the probability sum of the trans- 
mission of one of the remaining chains (whose number is equal to m%* — 24 'B)Ns 
= 2!ogmxM, _ 94181) is found to be very small. We agree to choose all N,- 
signal code words needed by us from 24(8)N1 ‘probable’ N,-signal chains and 
ignore completely the remaining such chains. This is possible since 


H(8)N, = He) HN > HN 
r 1 


(because c, < c < H(8)) and, consequently, the total number of ‘probable’ chains 
also exceeds the number 2” of required code words. 

Consider now all possible chains of the form 4;,A;, ... Aty, Bi: By, . - - Biyy 
formed of N, transmitted elementary signals A; and those N, signals B, into 
which these signals A, get converted after transmission through the communi- 
cation channel. The total number of such 2N,-term chains is obviously given 
by 


mN1pN1 — (log m+log r)Na 


The arguments adduced on p. 168 can be applied to these chains also, implying 
the following result: if all A: are so chosen as explained above, then only 
24(@B)M1 are ‘probable’ out of the total number 298 "+1087)4; of our chains. 
Moreover, the probabilities of all probable chains are nearly equal to each other, 
while the total probability of all the remaining 208+ 1!087)%1 _ p20) chains 


tFor example, let us consider the case of the binary symmetric channel (See pp. 263-264). 
If the M-sequence A,,A;,... Aig, is transmitted through such a channel, where N, is large 
enough, we obtain at the output obviously with a fairly large probability one of the N,-sequences 
distinct from the transmitted chain of signals in not less than N,(p — 8) signals but in not 
more than N,(p + 8) signals, where 8 is some small number (see discussion of the law of large 
numbers in Section 1.4). 


278 4. APPLICATION OF INFORMATION THEORY 


is quite small.t Consequently, the number of ‘probable’ 2N,-term chains A;,Ai, 
. +» Aiy, By, Bj, . .. Bay, exceeds the number of ‘probable’ N,-term transmitted 
chains 4i,A;,... Ary, by 


QH(epN1. OH(B)Ns — OL (@p)—-H(p)M1 — 9% (N12 
times. Hence it can be concluded that a whole group of 27 a'*)"3 chains 
B;,B;, ... Biy, of received signals corresponds to every ‘probable’ N,-term trans- 


mitted chain A;, Ain... Aiy,, Which gets converted into one of the chains 
B;,By, . . . Byy, with a large probability (i.e., with a probability very close to 


4; A; ; 
Oe iy, Group 9 
° 
Fig. 25. 
unity). For brevity, we designate this group of 24a‘)%1 chains B,,By,... By, 
corresponding to A;, Aj, . - . diy,, aS group B, corresponding to AiyAi,. . . At, 


(see the schematic diagram in Fig. 25). Combining each of 278): chains 
Ai,Ai, . . . Aéy, that are sufficiently ‘probable’ to be transmitted with 24% a(*)", 
chains of the group & corresponding to it, we obtain precisely all 2#(™P)", 
‘probable’ chains 4;,4,,.-- Ain, By, Bi, . . - Bin,- 

Two N,-term chains of transmitted signals 4;,A;,... Aww, and Ad Ais... Aity, 
should be considered to be ‘effectively distinguished from each other’ if two 
groups & corresponding to them are disjoint. In fact, the message A;,Ai,... 
Ain, (respectively Aij Ad... Aty,) after transmission through our communication 
channel is ‘almost surely’ (i.e., with probability close to unity) transformed into 
one of the chains B;,Bj, . . . Byy, belonging to the first (resp. second) group &. 
Hence, if the indicated two groups & are disjoint and it is known that either 
the message A,,Ai,... Aty, or AijAih . . . Ai, Was transmitted, then we can, for 
instance, in all cases when one of the chains of the first group G is obtained at 
the output, assume that the message A,,A4;,... Aty, Was transmitted, but when 
any other chain is obtained (including also all chains of the second group 9) it 
can be considered that Aij Ad... Ai, was transmitted. Here it is clear that 
the probability of erroneous decoding of the received message will be fairly 
small. In analogy to this, if it is required to choose 2” different code words, 
each of them consisting of N, signals A;, then in order that the probability of 


+The result is based on the fact that any 2N,-term chain 4,,4;,... Aiy, By, By, ... Bay, 
can be considered as a chain (A;,By,)(A;,By,) - - - (Ain Bry,) formed of N, successive out- 
comes of the compound experiment «8 (with mr possible outcomes), having the entropy H(«§). 


4.4. TRANSMISSION OF INFORMATION OVER NOISY CHANNELS 279 


erroneous decoding of the received message be small, it suffices to have an 
opportunity to choose these code words in such a way that for all 24% groups 
the ’s corresponding to them are disjoint. Since each group GY contains 


2ateNa _ (Nae lev HN chains B;,B;, ..» Biy,, there are 
Ha(«) HN (2 + 1) HN 
24 x 2HN—=2\ % 
chains in 27% groups B. Moreover, since all such chains By, Bs, . . . Biy, term- 


inate the ‘probable’ 2N,-term sequences 4;,4;,... Aiy, Bj, By... Byy,, they 
themselves are also ‘probable’, i.e., they belong to a set of N,-term chains of 
signals B, that arise not too infrequently at the channel output when J, signals 
A; are successively transmitted through channel and each time a transmitted 
signal is chosen at random with probabilities p°(4,), p°(A2), - .- , P°(Am) (regard- 
less of what signals were transmitted earlier). The number of such ‘probable’ 
chains B;,By,... Byy, (i.e., the ‘probable’ chains of N, successive outcomes of 
experiment «), as is know, is given by 


A) en 
QH(a)Ny 2 % : 


Let us now construct the ratio of the total number 2'4(®)/¢) HN" of ‘probable’ 
chains B;,Bj,... Biy, to the total number 2l'4p(*)/)+1]4" of such chains appear- 
ing in 24% groups G: 

A) aN He) Hal H(«) —H 
a a — 
2 4 (= a(t) 1) aN ( (a) g(«) 


(eon :) (78. \aw = 2 ms fs 


_ Ae aa yan 


It is seen that if c, is Jarger than c, then this ratio is less than unity, i.e., the 
total number of chains in our 24% groups G& is larger than the total number of 
all ‘probable’ chains Bj, By, ... Biy,. It is hence clear that for ¢, > c it is im- 
possible to select in any way whatsoever the 2%" code words such that all groups 
G corresponding to them are disjoint. This obviously is as it should be, since 
it is already known to us that, if c, > c, then it is impossible to transmit through 
our channel a message at a rate of L(c,/H) letters per unit time, and to achieve 
an arbitrarily small probability of its erroneous decoding at the channel output. 
But if c, is less than c, then the ratio set forth above is found to be greater than 
unity (since in this case (c/c,) — 1 > 0); furthermore, for extremely large N it 
is found to be equal to 2 raised to a large power, i.e., exceedingly large. Thus, 
for large N, the total number of chains in 2”% groups G form a negligible share 
of the overall number of ‘probable’ chains from N, signals By. The last situation 


-1)aw 
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yields a highly plausible premise that 2"" code words of length (H/c,)N can be 
chosen such that all groups QB corresponding to them are disjoint. Moreover, 
as we know, such a choice of code words assures for large N the possibility of 
deciphering the obtained message with an arbitrarily small probability of error. 

The arguments set forth above lend great plausibility to Shannon’s theorem, 
but obviously we cannot regard them as a mathematical proof of this theorem 
(this situation is more elaborately explained on pp. 290-291). In spite of this, 
for the present we confine ourselves to all that has been stated and pass on to 
analyze certain other problems connected with Shannon’s theorem. Later, how- 
ever, we Shall adduce on pp. 291-297, following Shannon, interesting (but not 
quite simple) arguments to prove conclusively that indeed there must exist such 
a choice of 27% code words that guarantees, if not complete non-overlapping of 
corresponding 27% groups &, then at least guarantees that this overlapping be 
sufficiently small so as not to affect the fact that the probability of erroneous 
decoding can be made arbitrarily close to zero. At the end of the present sec- 
tion (pp. 298-303) we shall analyze in greater depth another rigorous proof of 
the fundamental coding theorem, though related only to the special case of a 
binary symmetric channel. It is left to the reader to decide whether it is worth- 
while to devote his time to all this material (and when, whether now or later, he 
should follow the plan of exposition as given in the book), or whether he should 
prefer to confine himself to just the non-rigorous reasonings set forth above; in 
the latter case the entire concluding portion of this section (from pp. 290 to 304) 
may be skipped by the reader. 

The reader may only be cautioned in advance that both proofs of Shannon’s 
theorem exposed at the end of the section are moneffective in the same way as 
are all other known proofs of it: from them it follows that for sufficiently large 
N there necessarily exists such a method of choice of code words as to guarantee 
that the probability of error in reconstruction of each letter of the obtained 
message does not exceed a given (arbitrarily small) number e, but nothing is said 
about how one can find such a method of choice of code words (see also the 
beginning of the next section, where this position has been explained more pre- 
cisely). The question as to how in fact the code words ought to be chosen in 
order that the probability of error in decoding be made sufficiently small is dealt 
with in the next and last section of the book. 


It was noted above that Shannon’s theorem does not allow us to indicate in 
what specific manner we ought to choose code words in order that a message 
be transmitted through a communication channel at a given rate 


ae 


H letters per unit time, 


i ae ee DS 
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and also in. order that the corresponding probability of transmission error does 
not exceed a given smali number e. Let us now note that this theorem does not 
permit us also to state how large the number N of letters in coded blocks must 
be in order that such transmission become possible. This theorem implies only 
that if it is permitted to choose N arbitrarily large, then transmission with speed 
v, and error probability not exceeding « are possible, whatever v, < vand « > 0. 
However, since with increasing N the complexity of deciphering a code is con- 
siderably increased and further time-lag in deciphering is involved, it is of prac- 
tical interest to be able to evaluate also the least value of error probability « 
attainable in transmission at a given speed 2, by means of a code whose code 
words correspond to letter blocks consisting of not more than N letters, where 
N is some given number. This problem has been dealt with by C. E. Shannon, A. 
Feinstein, P. Elias, J. Wolfowitz, R. G. Gallager, R. L. Dobrushin, and other 
scientists; a detailed exposition and proof of the results obtained by them can be 
found, for instance, in the papers [181]-[183], [186] and books [2], [8], [9], [11] 
and [23], which are quite complex. Without going into details we shall simply 
state here the basic fact stemming from all these investigations. 

Recall that the transmission of an N-letter block at a rate v, = L (c,/H) letters 
per unit time, where c, < c, is attained when we assign to individual N-letter 
blocks, code words consisting of N, = (AH/c,)N elementary signals. It is appro- 
priate to use c, and N, in place of v, and N when calculating the error probability 
corresponding to the given values of 7, = L(c,/H) and N, since c, and N, des- 
cribe more directly the process of information transmission over the communic- 
ation channel. It is found that for fixed c,< c and N, there always exists a 
method of transmission (i.e., a coding method that permits us to choose 2"? code 
words consisting of N, elementary signals, and a decoding method that gives the 
rule for deciphering the received N,-term chains of elementary signals B,), for 
which the probability of erroneous interpretation of every transmitted code word 
does not exceed the quantity 


where a is some number greater than unity.t| The number a obviously depends 
on c,, the smaller the c, (i.e., factually smaller the rate 2, of information trans- 


+The formula derived here can of course be rewritten as « = 1/a® , where a, = aHi%1 is a 
new number (this also is greater than unity). However, a, is found to depend also on the 
entropy H of the transmitted message, whereas a depends only on the value of ¢, and the 
characteristics of the communication channel used. For the reader acquainted with natural 
logarithms, it is useful to keep in view also that in the scientific literature the formula for e is 
usually written in the form ¢ = e—£N:, where e = 2.718 ... is the base of natural logarithms 
and E = In ais the natural logarithm (with base e) of a. Since y = e—E isthe so-called expo- 
nential function, the preceding formula for e is frequently called the exponential bound of errer 
probability, or simply the exponential bound of error. 
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mission through the communication channel), the larger is a. It seems natural 
to expect that when c, (and, hence also v,) tends to zero the number a increases 
indefinitely (since by decreasing indefinitely the information transmission rate, 
an arbitrarily small error probability can be attained for any fixed N). Actually, 
however, all derivations of the above mentioned formula for e for quite low 
transmission speeds are found to be rather crude and they usually indicate that 
a tends to a finite value asc, ->0. When c, tends to c (i.e., the transmission 
rate v, to v), the number a tends to unity, so that « also approaches unity with 
the growth of v,. The value of a for given c, is different for different commun- 
ication channels; a schematic diagram of the dependence of a on c, for a fixed 
channel is given in Fig. 26. 


@ 


Fig. 26. 


It is clear that Shannon’s noisy coding theorem directly follows from the for- 
mula indicated for « and the fact that a> 1 for any c, <<. Moreover, this 
formula extends appreciably Shannon’s theorem, which says only that e can be 
made arbitrarily small if only N (or, equivalently, N,) is chosen sufficiently large 
(but states nothing about precisely how e decreases with the growth of N). It 
is pfecisely the last situation we had in our view on p. 275 when we remarked 
that the results obtained in [186] are sharper than the fundamental coding 
theorem. 


We now pass on to the case of message transmission at a rate 2, greater than 
the limit rate v = L (c/H) letters per unit time. This is in general of less inter- 
est than the case of transmission at a rate v, < v and the results related to it 
are less spectacular than Shannon’s fundamental theorem; nevertheless it merits 
our examination. We have already noted on p. 273 that an error-free informa- 
tion transmission cannot take place at a rate v, > » letters per unit time; a 
similar statement may also be found on p. 279 where it is indicated that if 
c, > c, then 27% groups &, corresponding to the code words of all possible 
‘probable’ N-letter blocks, can in no way be so chosen that they are disjoint. 
In reality, however, the reasonings given on pp. 273 and 279 allow us to draw 
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only superficial conclusions. It is of course true that an error-free message 
transmission cannot be accomplished at a rate exceeding v = C/H letters per 
unit time. However, as a matter of fact, even in the case of transmission at a 
rate v, < v we cannot assert that error-free message transmission is possible 
but can state only that in this case the probability of erroneous interpretation 
of every transmitted letter can be made as small as desired (by using sufficiently 
long chains of elementary signals as code words).t Hence, a precise statement 
of the converse to Shannon’s fundamental noisy coding theorem must not assert 
that for v, > v an error-free information transmission is impossible, but rather 
that for any fixed v, > v there can be found a positive number gy > 0 (which 
apparently must depend on 2, and increase for increasing v,) such that in the 
case of information transmission through a communication channel at a rate v, the 
probability q of erroneously deciphering every transmitted letter of the message 
for any method of coding and decoding (independent of the values of N and Ny) 
is not less than qy. A conjecture on the validity of such a converse to the noisy 
coding theorem was also made by Shannon [21] and later it was rigorously 
proved by Fano [8]; we shall now proceed to consider its proof following Fano. 

In the first place, however, the very statement of the theorem under considera- 
tion needs some sharpening. It is easy to see that the statement made about 
the probability of erroneously decoding cannot be necessarily true for all trans- 
mitted letters. Indeed, we can, for instance, stipulate that we shall decipher all 
received letters as the first letter of the alphabet—here the error probability will 
be zero in all cases in which the first letter is actually transmitted. On the other 
hand, it is also clear that to decipher all received letters as the first letter is in- 
appropriate—here, in fact, we generally make no use whatsoever of the com- 
munication channel and commit an error every time a letter other than the first 
letter is transmitted; hence the mean error probability in this case will be large. 
At the same time it is most natural to understand the probability g of errone- 
ously decoding a single transmitted letter specifically as the mean error probabil- 
ity and hereafter we shall indeed do so. 

Thus, assume that the transmitted text is written by means of an n-letter 
alphabet @,, a, ..., @n, and that the probabilities of the appearance of letters 
@,, @,,..., Ay at arbitrary (but fixed) places in this text are respectively equal 
to the given numbers p,, po,..., Pn. By gq we understand the mean value of 


tIt may be noted in this connection that Shannon [185] had introduced also the concept 
of the zero error capacity C, of the channel, defining it as the highest rate (in bits per unit time) 
at which completely error-free information transmission can be conducted over a given com- 
munication channel. The reasoning on p. 273 shows only that, no matter what communica- 
tion channel, C, cannot exceed the channel capacity C defined on p. 263, a situation that seems 
to be almost obvious. In fact, the zero error channel capacity C, is usually appreciably smal- 
ler than C; curiously, C, is found to be a more complex quantity than the usual channel capa- 
city C, the value of C, is generally considerably more difficult to evaluate and it has lesser in- 
tuitive content. 
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error probability, i.e., the quantity 
Gq = PiQ + Poq2 + ~~~ + Pngns (*) 


where q, is the probability that the letter a, after transmission through the com- 
munication channel is erroneously understood at the output as an alphabet 


letter other than a, and the quantities q,,..., qn carry a similar sense. It is 
essential that we are able to calculate this mean value q differently also. Suppose 
that p;, p5,...,p, are the probabilities of finding the letters a,, a.,...,@n 


at an arbitrary (but fixed) place of the message obtained at the channel output 
by deciphering a received sequence of elementary signals B;. Furthermore, denote 
by q, the probability that the letter a, is obtained at the output due to incorrect 
deciphering of the received message (i.e., the corresponding place of the trans- 
mitted m:2ssage is in fact occupied by a letter other than a,), and by qg3,...,4, 
similar probability errors related to the cases of having obtained the letters 
@,...,@,. It is clear that the probabilities pj, p35, ..., DP, in general do not 


coincide with p,, p2,..., pn (they depend on the probabilities p,, p),.. . . Dn, 
and also on the coding-decoding method and the characteristics of the com- 


munication channel). However, the mean value of error probability for a single 
received letter can be expressed also in terms of pj, po, ... , p,,, namely 


Q=Piqg,tPegat..-+ped: (**) 


It is precisely the formula (**) that we shall use primarily hereafter. 

Taking up the proof of the converse to the noisy coding theorem, we start 
with the simple case in which the transmitted message is written by means of a 
two-letter alphabet (for convenience, we denote by a and b the alphabet letters 
in this case). Suppose that f is an experiment consisting of determining at the 
input an alphabet letter of the message transmitted through the communication 
channel (not an elementary signal, as on pp. 261-262, but exactly a letter!), and 
that « is another experiment consisting of deciphering a letter at the channel 
output. Then, both these experiments can have two outcomes (a and 3), the 
probabilities of the two possible outcomes of « being p; and p5 (so that r; + ps 
= 1), and those of 8 given that « has the outcome a (resp. b) being 1 — g; and 
gq; (resp. gz and 1 — g3). Consequently, 


Ha(8) = —qi log gq’, — (1 — qi) log (1 — qi) = Aq’), 
(8) = —q', log qi, — (1 — qi) log (1 — 91) = Aq’), 


tIt is not difficult to understand that the right-hand sides of both equations (*) and (**) 
define the mean frequency of errors in successive deciphering of a large number of letters of the 
transmitted message. 
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where H,(8) and H,(8) are the conditional entropies of B given that « has the 
outcomes a and b, respectively, and, as usual, 


h(p) = —p log p — (1 — p) log (1 — p). 
Using the relations, 
H.(8) = h(i), Ay(8) = h(qs), 
we obtain 
He(8) = pi Ha(8) + p's Ho(8) = pih(qi) + prh(q’). 


We now make use of the fact that h(p) (whose graph is given in Fig. 8 on p. 49) 
is aconvex function in the sense explained in Appendix I on p. 347. Hence, by 
Theorem 2 of Appendix I (p. 350) for any nonnegative p, and p’, such that 
Pi +p’, = 1, we have 


Puh(Qi) + prh(q) <h(pigs + Prqa) = h(q), 
where g = pq + pig. Thus 
Ha(8) < A(q), (A) 
and 
I(a, B) = H(8) — Ha(8) > H(B) — AQ). 


We now recall that (a, 8) is the information contained in an arbitrary text 
letter obtained at the channel output, concerning the corresponding letter of the 
transmitted message. Through the channel v, letters are transmitted per unit 
time, i.e., the amount of information transmitted per unit time equals v,/(«, ) 
(the successive letters of the message are considered to be mutually independ- 
ent). But the amount of information transmitted per unit time cannot exceed 
the channel capacity C of our channel;f hence, furthermore, 


»,[H(8) — h(q)] < C. 


tRecall that Cis the maximum information about the transmitted elementary signals that 
can be extracted from the elementary signals obtained per unit time at the output. If the en- 
coding of a sequence of Je(ters of a message into a sequence of elementary signals is not unique 
(for example, if the random coding described below on p. 292 is used), then the passage from 
a to experiment «,, consisting of determining the transmitted signal, is accompanied with some 
loss of information; nonunique decoding will a!so have a similar effect. For us here, however, 
the only important fact is that in every case the information v,I(«, 8) about the transmitted 
l¢tters contained in the received letters cannot be greater than C (see p. 89). 
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Since C/H(8) = v, the preceding inequality can be conveniently written in the 
form 


= a ee Oe (B) 


Consider the graph of the function 1 — (A(g)/H(@)) = g(qg) (see Fig. 27a, b in 
which this function is depicted for the case in which H(?) = 1, i.e., when the 
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Fig. 27. 


outcomes a and b of B are equally probable, and for the case in which H(8) < 1). 
The graph shows that if 2, <2, ie., if v/v, > 1, then inequality (B) can be 
satisfied for all values of g, including g = 0. If, however, v, > y, ie., v/v, < 1, 
then this inequality can be fulfilled if and only if the value of q belongs to some 
interval of values lying to the left of the point qo, where gp > 0. 

Thus, for v, > v the mean error probability q cannot be less than a certain 
qo > DO, i.e., we have proved the statement designated above as the converse to 
the noisy coding theorem. With the growth of 2, (i.e., with decreasing v/v,) the 
value of go increases; as ¥, > ©0 (i.¢., v/v; > 0), gg obviously tends to the prob- 
ability p, of the transmission of that one of the letters @ or b which is trans- 
mitted Jess frequently than the other letter. The last result is clearly quite 
natural; in fact, when the transmission rate is extremely large we can transmit 
almost no useful information through our channel, and hence the most reason- 
able deciphering method in this case is the one by which all accepted letters are 
deciphered as a letter having the highest probability of being transmitted. But 
for such deciphering the mean error probability q is obviously equal to the 
probability of a more infrequent letter among the used letters a and b (note that 
for the indicated ‘deciphering’ a communication channel is not needed at all). 
If, however, the probability of the appearance of both text letters is the same, 
then for an extremely large transmission rate, when a communication channel 
is generally found to be of no practical use, there is no basis at all for us to 
choose this or the other value of the reccived letter, so that deciphering can be 
carried out here completely ‘at random’. The mean error probability q in this 
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Case as 0, — co tends to }, since this is also the probability of erroneous decod- 
ing ‘at random’ (and simultaneously the probability of a ‘more infrequent’ 


Fig. 28. 


letter). A schematic graph of the dependence of the lower bound q of error 
probability on the transmission rate v, is given in Fig. 28. The coincidence of 
the graph for v, < v with the abscissa (i.e., gg = 0) obviously corresponds to 
Shannon’s fundamental coding theorem which asserts that for v, < v the error 
probability can be made as small as desired. (However, our conclusion proving 
only that the mean probability error cannot be smaller than qo for fixed value 
v, > v does not by itself yield the assertion that for », < v the quantity q can 
indeed be made as small as desired.) The positiveness of qo for all 2, > v just 
forms the content of the converse to the coding theorem. 

The case in which the transmitted message is written in a language employing 
an alphabet of n letters a,, @,,...,@, is not considerably more complicated 
than the particular case of the two-letter alphabet analyzed above. Here, how- 
ever, in place of the completely elementary inequality (A) we have to make use 
of a more general Fano inequality having the form 


H2(8) < h(q) + ¢ log (n — 1), (A’) 


where « and 8 have the same meaning as above and q is again the mean error 
probability. 

Fano’s inequality (A’) has a quite simple and intuitive meaning. In fact, 
H.(8) is the mean amount of uncertainty of the outcome of 6 when the outcome 
ofais known. But the outcome of 8 given the outcome of « can be ascertained 
by means of the following two auxiliary experiments. First we determine whe- 
ther or not the error would occur in the transmission of the corresponding letter of 
the message. This implies that we carry out an experiment y capable of having 
only two outcomes (answers ‘yes: it occurs’, or ‘no: it does not occur’). The 
mean probability of the outcome of y being positive (answer ‘yes’) obviously 
equals gq. Making use of the convexity of the function A(p) it is, therefore, easy 
to infer that the mean amount of uncertainty of the result of our first auxiliary 
experiment cannot exceed A(q) (see on p. 285 the inequality preceding (A) and 
also the similar general derivation on p. 304). It is further clear that if the 
error in the transmission does not occur (i.e., if the outcome of y is negative), 
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then the results of y and « uniquely determine the outcome of 8. If, however, 
the outcome of y is found to be positive (which happens on an average in a 
portion q of all cases), then the knowledge of the outcome of 7 does not remove 
all uncertainty in the outcome of 8, necessitating an extra auxiliary experiment 
Y, in order to ascertain exactly what letter out of n — 1 letters other than those 
received was indeed transmitted. Experiment y, can have 1 -- 1 different out- 
comes; hence the amount of its uncertainty (the entropy of y,) cannot exceed 
log (n — 1). It is clear that the total amount of uncertainty Ha(8) must be 
equal to the amount of uncertainty of the first auxiliary experiment y added to 
the amount of uncertainty of the second experiment y,, multiplied by the mean 
frequency of the cases in which y, is found to be needed. This immediately 
implies Fano’s inequality (A’) (for more details on this, see the text in small 
print on pp. 303-304). 
We now note that Fano’s inequality implies the inequality 


I(x, B) > H(8) — h(q) — q log (a — 1). 
Hence 

v,[H(B) — h(q) — qlog(# — I] <C, 
where C = vH(§), i.e., 


_ hg —qilog(n—1) Vv 
H@) ae (8) 


In the particular case in which H(@) = log n, the function 


&alg) = 1 — a 

differs from the function C(p) depicted in Fig. 20 on p. 267 (for the particular 
case # = 4) only by a constant factor; for convenience we draw a similar graph 
(Fig. 29a). Schematic forms of the graph of the function g,(q) for H(8) < logn 
(i.e., when not all alphabet letters are equally probable) are given side by side in 
Fig. 29b. We see that if v, < 2 (i.e., v/v, > 1), then inequality (B’) holds for 
any gq 2 0; if, however, v, > v (i.e., v/v, < 1), then (B’) is satisfied only for 
values of g larger than some positive number gg. This shows the converse to the 
coding theorem to be true also in the general case of n-letter alphabets. The 
dependence of the value of g, on the transmission rate v, here again has the 
form schematically depicted in Fig. 28; the limiting value of go as v, > 00 (i.e., 
as v/v, >0) in the case in which H(8) = log n is equal to (m — 1)/n and it 
decreases with decreasing H(@).t 


If v, is quite large, then the communication channel becomes useless and hence here it 
remains to decode all the rcceived letters as the letter having the highest probability of being 
transmitted. In this case, the mean error probabilily qg equals | — p,. where p, is the largest 
of the probabilities of alphabet letters. Since, however, the inequality (B’) is not exact, an 
estimate obtained from it of the lower bound gq, of the mean error probability will not in 
general necessarily coincide with the lowest actually attainable value of g. 
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Fig. 29. 

Let us note that by what has been proved in the present section the fundamen- 
tal noisy coding theorem and the converse to the noisy coding theorem differ 
sharply both in the method of proof and in their character as well. It is true 
that the probability of erroneous determination of a single transmitted letter 
appears in the statement of both theorems. In fact, however, while considering 
the fundamental coding theorem, the orjginal message in letters was just briefly 
touched upon at the start of our discussion and then we concentrated only on 
chains of N, elementary signals transmitted directly through the communication 
channel. The essential fact here was just the following : if we use code words 
(‘blocks’) consisting of N, elementary signals, then transmission at a rate 7,= 
L (c,/H) letters per unit time demanded that these code words correspond to N- 
letter messages, where N = (c,/H)Nj, i.e., that (in the case of N, sufficiently 
large) there are not less than 27" = 2°141 ‘probable’ code words. Thus, it was 
required to show only that, if c, < c (where c = max I(«, 8)), then for sufficiently 
large N, it is always possible to choose 2°1%1 code words of length N, in such a 
way that the probability of erroneous decoding of a chain of N, elementary signals 
obtained at the channel output will be less than an arbitrary (but preassigned) num- 
ber «, irrespective of the specific code word that is being transmitted (here natural- 
ly « is chosen very small, say, equal to 0.001, or 0.0001, or 0.000001). This 
statement (related just to a communication channel and lengthy chains of ele- 
mentary signals transmitted through it, but by no means connected to the original 
message in letters) forms exactly the essence of the fundamental coding theorem. 
As regards the converse to the coding theorem, it is essentially related to the 
letters of the message and not to chains of elementary signals transmitted 
through the communication channel. 

There exists another theorem which also is the converse to the fundamental 
coding theorem, but is concerned only with a communication channel and 
lengthy chains of elementary signals transmitted through it. This theorem says 
that, if c, > c and N, is sufficiently large, then no matter what 2°.%1 equally 
probable code words (i.e., chains of elementary signals) of length N, are chosen 
and what method of deciphering the received N,-term chains of elementary signals 
is used, the mean probability of erroneous decoding of the received chain all the 
same exceeds an arbitrary (but preassigned) number py < 1 (the number py is 


290 4. APPLICATION OF INFORMATION THEORY 


naturally chosen here sufficiently close to unity, say, equal to 0.999, or 0.9999, 
or 0.999999). It is, of course, clear that the closer py is to unity, the larger is 
the required value of N,. As regards the mean error probability in the statement 
of the theorem, it obviously coincides with the arithmetic mean 


Posi + Pore +--+ “t+ DosgeiMi 
Qe1N1 


where pp,; is the probability of a decoding error when the ith of our 2°11 code 
words is transmitted. 

The validity of the stated theorem is closely related to the discussion on 
p. 279. It was shown there that for c, > c and very large N, the total number of 
N,-term chains in 2°1%1 groups @ (i.e., in groups of received ‘probable’ chains 
corresponding to the 2°!" ‘probable’ code words of length N,) greatly exceed 
the total number of all ‘probable’ received chains. Hence the N,-term chains 
received belong in general simultaneously to the vast number of different groups 
B, so that the probability of their correct decoding is quite low. These argu- 
ments lend utmost credibility to our theorem, even though they cannot be a sub- 
stitute for its rigorous proof. Such proof can be found, for instance in [2], [11] or 
[23]; this proof is not quite straightforward and we shall not dwell upon it here. 
The theorem under consideration itself is called by Wolfowitz (the first to prove 
it rigorously) the strong converse of the noisy coding theorem and it is frequently 
referred to by this designation in the literature on information theory. However, 
the designation is not quite appropriate, since it may create a wrong impression 
that the conventional converse of the noisy coding theorem proved above follows 
from this new theorem (in fact, neither of the two converse theorems derived here 
is a consequence of the other). Hence, apparently, it shall be more appropriate 
that following Gallager [11] the theorem under consideration is called the con- 
verse of the noisy block coding theorem (i.e., the coding theorem which uses as 
code words blocks of elementary signals of a fixed length). 


We now pass on to the more accurate proof of Shannon’s fundamental noisy 
coding theorem of which we spoke on p. 274 et seq. To start with, following 
Zaremba [188], we give an example which shows quite clearly that from the fact 
of the total number of chains B;,B;,... Biy, in 27% groups @ being exceed- 
ingly small in comparison to the total number of such ‘probable’ chains, it still 
does not follow at all that these groups can be chosen to be such that they are 
disjoint. Consider from this objective a collection of all possible chains of 10 
elementary signals, each of which can take two values. It is clear that the total 
number of such chains is 2° = 1024. We further associate with each chain 
of the group all 10-term chains differing from the given chain by not more thay 
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three signals. Besides the given chain, this group obviously contains ( . ) = 10, 


( S ) = 45 and es ) == 120 chains differing from the given chain exactly by 


one, two and three signals, respectively; therefore, the whole group consists of 
1+ 10-+ 45 + 120 = 176 chains. Since 176 is very close to being 3 of 1024, 
it might be thought that three chains could be chosen here without any singular 
difficulty such that three groups of 176 chains, corresponding to them, would be 
disjoint. But this is not correct: it can be shown that the groups corresponding 
to any three chains necessarily intersect. 

Indeed, let us denote two values of our signals by the digits 0 and 1 and let, 
for instance, the first group correspond to a ‘zero chain’ of ten zeros. It is easy 
to understand that only the groups corresponding to 10-term chains containing 
more than six I’s do not intersect with the first group. But in every two 10- 
term chains each containing seven or more 1’s, not less than four of these 1’s 
would lie in both the chains at one and the same place. Consequently, the two 
given chains differ from each other by signals at not more than six places and 
hence the groups corresponding to them intersect with each other. Obviously, 
nothing is altered, if we start with any other chain (and not with the ‘zero 
chain’ 0000000000): our two groups of 176 chains not intersecting with one 
and the same third group necessarily intersect with each other. 

In exactly the same way it can also be shown that for any k among all groups 
of (3k + 1)-term chains, differing from some one such chain by not more than k 
signals, it is impossible to find more than two disjoint groups. Meanwhile, it can 


be easily verified that the ratio of the number of chains in such a group | equal 


to the sum + (TYEE) ER ET)) to the total 


number of all possible (3k + 1)-term chains (= 23**1) will always decrease with 
increasing k. Thus, for k = 8, 3k + 1 = 25 this ratio is close to 1/20, and if k 
is chosen sufficiently large, the indicated ratio can be made as small as desired 
(smaller than any preassigned small number). Thus, the total number of chains 
in three groups comprises an insignificant part of the number of all possible 
chains, but nevertheless any three groups necessarily intersect. Hence, in the 
case of Shannon’s theorem also, it is impossible to justify the possibility of 
choosing 27% disjoint groups by the fact that the total number of chains in them 
is very small in comparison to the number of all ‘probable’ chains. It has also 
to be proved rigorously that in a given case the situation is not such as that in 
the example due to Zaremba. 

In fact, none has so far succeeded in proving rigorously that 2”% chains 
A;,Ai, . . - Aiy, can be chosen in such a way that any two of the 27% groups 
@ corresponding to them are disjoint. However, it can be shown that there 
certainly exists a choice of these chains such that the corresponding groups @ 
are almost disjoint and hence their overlapping can be ignored. This fact can 
be made clear by means of the following arguments mainly due to Shannon [21]. 
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To start with, we choose the requisite 27" chains A:,Ai. .. . Aiy, by a method 
which may seem, at first sight, to be clearly unreasonable, and specifically ‘at 
random.’ This choice ‘at random’ can be accomplished thus: we number all 
24'8)N1 ‘probable’ chains A,,4;,... Aiy, in an arbitrary order, write out their 
numbers on 24‘8)"1 pieces of paper, place all these papers in an urn and mix 
them well, and then draw the pieces of paper one at a time from the urn 27% 
times, replacing the paper drawn after each draw and again mixing the contents 
of the urn. The chains A;,4;, ... Aiy, With the numbers drawn we take as our 
2®N code words (such a method of choosing code words is called random coding). 
It is clear that under random coding the same number may be drawn two or 
more times, so that some of the 2#” selected chains turn out to be identical to 
each other and obviously they cannot be distinguished by any means at the re- 
ceiving end; this situation alone gives an impression that the suggested method 
of choosing code words is undoubtedly irrational. In fact, however, for large 
N the probability of such coincidence is negligibly small (since the number 
27 BINy — 2'H(BIICL)AN of different ‘probable’ chains, when N is large, is many 
times larger than the number 24). As we shall see later, this allows us to ignore 
completely the possibility of coincidence. 

We now assume that the signals A:,, As,,..., Aiy, are transmitted success- 
ively through our communication channel, the collection of which forms pre- 
cisely one of the code words chosen by us. Because of the presence of noise, 
these signals are in general somewhat distorted during transmission; as a result, 
we obtain at the receiving end of the channel a sequence of signals By, By, . - . 
... Biy, different from Ai,Ai,... Ajy,. It is clear that the chain B;,B,, . . . Bry, 
belongs with probability quite close to unity to the group @ corresponding to 
the chain A;, Ai, . .. Aiy,. But this chain B, B;, . .. Biy, will at the same time 
belong also to groups @ corresponding to many other chains of Ni transmitted 
signals. This specific circumstance makes it difficult to decipher the received 
message. 

It is rather easy to estimate the total number of different ‘probable’ chains 
Ai, Ai, - . . Aiy, having the property that the groups @ corresponding to them 
contain the given chain B;,B;, ... Byy,. In fact, the total number of ‘probable’ 
2N,-term chains Ai,Ai. ... Ai, B;,Bi, ... Biy,, aS we know, is 24'*®)"1, and 
the chains B;,B;, ... Biy, appearing in them all belong to the set of 27'™)41 
equally likely ‘probable’ received chains. Thus, the number of ‘probable’ 2N,- 
term chains exceeds the number of ‘probable’ N,-term chains B;, Bi, ... Biy, by 
2A(@B)My ; 2A(0)Ny — 2Ha'P)Nt times. It can hence be concluded that all possi- 
ble ‘probable’ 2N,-term chains are obtained by combining each of 24(*)N1 
‘probable’ chains B;, Bj, ... Biy, of the received signals with 2%«‘®)1 different 
chains Ai,A;, ... Aiy, Of transmitted signals. It is precisely these 2¥«(8)™1 
transmitted chains that possess the property that the given chain B;,Bj,... Biy, 
enters the groups 8 corresponding to them. A collection of all these chains 
Ai, Aig... Aty, We call group A corresponding to the chain B;, By, .. . Biy, (see 
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the schematic Figure 30 in which the arrows from the chains of group .J to the 
chain B;,B;,... Bjy, indicate that all groups @ of these transmitted chains 
contain B;, Bi, ... Bin, and that, consequently, there exists the real probability 
of any chain of group UJ to be received at the channel output as the chain 
Bj, Big rr Bin). 


De 


Group 4 aera = 
P — © Bi, Big-- + Biyy, 


Fig. 30. 


The group J (consisting of 2%a‘®)"1 chains of transmitted signals correspond- 
ing to the fixed chain B;, Bj, ... Byy, received at the channel output) plays a 
central role in the method we shall use to decode the received message 
B;, Big... Bin, If the indicated group A contains only one of our code words, 
then we shall assume exactly this code word to have been transmitted. However, 
in the case in which J contains more than one code word, or contains no code 
word, or finally the received N,-term chain does not belong at all to the collec- 
tion of 24'*)"1 ‘probable’ chains Bj,B;, . . . Biy,, we shall assume any one code 
word chosen arbitrarily from the existing code words to have been transmitted 
(say, the code word with number | to have been transmitted in all these cases; 
it will be seen later that this specific agreement is in fact of no consequence). 

Now we have already chosen the coding method (i.e., finding the 24” code 
words needed by us) and the decoding method (i.e., deciphering the received 
message). Hence, we can proceed to determine the probability of decoding error. 
Here, however, we are immediately confronted with one difficulty. Suppose the 
code word 4i,Ai, . . . Ai, to have been transmitted and the message B;, By... . 
Byy, to have been received at the output. Let us now denote by P the probability 
that by using the decoding method described above we arrive at a wrong con- 
clusion, i.e., conclude that some code word other than 4; Ai,... Ain, was 
transmitted. It is clear that the quantity P, in principle, can be different for 
different code words 4i,Ai, . .. Aiy,; thus, for instance, our decoding method 
explicitly places the code word with number 1 in an exclusive setting. Is it neces- 
sary because of this that the quantity P be calculated separately for different 
code words (or separately only for the first and all remaining such code words)? 
We shall see below that the answer is negative, because we shall use estimates 
that remain valid for all code words without exception. But, besides, our de- 
coding method depends also on the choice of code words to be used and this 
choice, as we know, is determined by the outcome of an experiment consisting 
of 24% draws of a paper from the urn, i.e., depends on certain random events. 
Hence P is also a random yariable in the sense explained on p. 5. “Such 4 
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variable can have many different values; we shall calculate below just the mean 
value of P. 

We know that if the number N, = (H/c,)N is sufficiently large, then the 
message Ai, Ai... . Ain, is transformed into one of the chains By, By, ... Biy, 
of the group @ corresponding to this message with probability arbitrarily close 
to unity. Furthermore, we assume N, to be so large that the indicated prob- 
ability is not less than 1 — (e/4), where « is a preassigned small number. Suppose 
now that B;,B;,... Byy, is the ‘probable’ chain of received signals which 
belongs to group @ corresponding to some code words Ai, Ai, .. . Aiy,- Denote 
by Q the probability that the chain referred to also belongs simultaneously to 
a group & that corresponds to at least one more code word (i.e., the probability 
that the group corresponding to our chain B,,B;,... By, contains, in addition 
to A;,Ain . . . Ain, at least one more code word). It is clear that both Q and P 
are random variables. Further, it is obvious that the received message Bj, Bj, ... 
Bix, will certainly be decoded correctly if the following two conditions are 


satisfied : 


(A) This message belongs to the group @ corresponding to the transmitted 
code word. 

(B) Except for the above-mentioned group it does not belong to any of the 
groups 8 corresponding to other code words used. 


Hence an erroneous decoding can take place only if either the condition (A) or 
(B) is not satisfied. But we know that the probability of the sum A + B of two 
events A and B (meaning respectively that the events A and B do not take place) 
does not exceed the sum of the probabilities of 4 and B (see pp. 9-10). Con- 
sequently, the probability of erroneously decoding the received N,-term chain 


must satisfy the inequality 
& 
P< 7tQ, 


here e/4 is greater than or equal to the probability that condition (A) is not satisfied 
(i.e.. the event A takes place) and Q is equal to the probability of (B) not being 
fulfilled (i.e., the probability of B). In this inequality ¢/4 is a fixed number, 
but P and Q are random variables; for estimating the mean value of P it is, 
therefore, necessary only to estimate the mean value of Q. 

Besides the code words Aj, Ai, . . . Ain,, there exist also 27” — 1 other code 
words. We renumber afresh these 24% — 1 words in an arbitrary order and 
denote by a: the random event such that the group J corresponding to the 
chains Bj,B,,..- Byy, contains the ith code word. Condition (B) will not be 
satisfied iff at least one of theevents a, as, ...,4,HN_, Occurs; in other words, 
event Bequals the sum of events a + a, +...+ 4.yn_,- But the probabi- 
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lity of the sum of events cannot exceed the sum of the probabilities of these 
events (see pp. 9-10), hence 
Qaqnt+a+...+ GAN 


where q; is the probability of ai. 

Let us now try to determine the mean value of probability q,. Since the ith 
code word is chosen at random (as are all the remaining code words), hence 
with the same probability 2-¥‘®'™1 it can coincide with each of the 24(®)¥1 
existing ‘probable’ chains of N, transmitted signals A,. In those 24'P)¥1 cases 
in which the ith code word is found to coincide with one of the 2%«'8)¥1 chains 
belonging to the group .4 that corresponds to the chains By, By... . Byy,, the 
event a; takes place, i.e., its probability is unity; in the remaining 24‘0)™1 — 
2Ha(B)N1 cases this event does not take place, i.e., its probability is zero. Thus, 
gi = | for 27«'8)%1 equally likely outcomes of an experiment consisting of the 
draw of 24(8)41 papers from the urn and g; = 0 for 24 (8)N1 — 2fa'B)N? the 
remaining equally likely outcomes. Hence, it is clear that 


m.V. qj 
Ha(BN1 QH\8)N1 — 2Ha(B)N1 Aa(B)N2 ; 
= 3ap™ * — >be = Sagar = 2a a6 My, 


But the mean value of all variables q, is the same (because all numbers i are 
equivalent), and Q does not exceed the sum 24” — | of the variables q;; hence 
the mean value of Q is not greater than 


_ A) — Hal) HN = -(< = ) HN 
(2HN — 1) x 2[AalB)-A(B)IN1 < QHN x 2 C1 2 Cy F 


We now recall that c,<c. This implies that for large N the expression 
appearing on the right-hand side of the preceding inequality is represented by 
the number 2 raised to a negative power extremely large in absolute magnitude, 
i.e., it is quite small. In particular, no matter how small the chosen number e, 
N can be taken so large that this expression (and hence also the mean value of 
Q) is less than e/4, 

But we know that P < (e/4) + Q; hence 


m.v. P<m.v. Q + = 


But 


€ 
m.v. Q or 


for sufficiently large N. Hence, choosing N sufficiently large, it can always be 
assured that the mean value of the probability P of erroneous decoding of any 
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of the 24% code words (which correspond to 2” ‘probable’ N-letter messages) 
is less than e/2, where € is any preassigned (no matter how small!) positive 
number. 

The result obtained facilitates the proof of Shannon’s fundamental noisy cod- 
ing theorem. For this we make use of the fact that the mean value of any 
random variable cannot be less than all its possible values (see pp. 6-7). In appli- 
cation to our case this means that among the (24(8)%1)?*" different possible 
choices of our 2#% code words (i.e., among all different outcomes of the experi- 
ment that consists of 2"% successive draws of papers from the urn containing 
24'B)N1 papers) there is at least one for which the value of P is found to be less 
than ¢/2. 

The last assertion is quite close to the one we desire to prove but it is still 
inadequate for our purpose. The point is that P is the probability that some 
fixed transmitted code words Aj, Aj. . . . Ain, Will be decoded erroneously at the 
channel output. It is, however, required to show that there exists some choice 
of the code words for which the probability of a decoding error when any of 
them is transmitted through a communication channel is less thane. Denote 
now by P; the probability of erroneous decoding of the transmitted ith code 
word. Then, Pi, Po,..., Pow are the random variables, and mean value of 
each of them can be estimated in exactly the same way as the mean value of a 
variable fixed in them (denoted by P in the above discussion). Hence, the mean 
values of all variables Pi is less than ¢/2; but this still does not imply that for at 
least one of the choices at random of 24% code words the values of all variables 
P,, Ps, ..., Pun will be simultaneously less than ¢/2. 

The preceding difficulty can, however, be circumvented by the following in- 
genious method. We choose at random not 2% chains Ai, Aj, ... Aiy,, but 
two times their number, i.e., 2 x 27% chains. We take these 2 x 24" chains 
Ai, Ai, ... Ain, a8 2 X 24N code words and transmit them all through our 
communication channel, deciphering the received message Bj,By,... Biy, in 
exactly the same way as described above. Since 2 x 27N = 2HN+1 — 2mN, 
where H, = H + (1/N) for sufficiently large N differs from H as little as desired, 
it is easy to see that all preceding estimates also remain valid in this case. In 
other words, here also it can be shown that the mean value of the probability P 
of erroneous deciphering of the chain B;,Bj,... Biv, received at the channel 
output, over which some one of our 2 X 24% = 241% code words was trans- 
mitted, is for sufficiently large N necessarily less than ¢/2. Thus, if P,, P:,..., 
... , PexgHw are the probabilities of erroneous deciphering of the Ist, 2nd,..., 
...,2 % 28¥th code words transmitted through the communication channel, 
then for sufficiently large N the mean values of all these variables are less than 
e/2. 

We now consider a new random variable, namely 


pes Pi, + Po +... t+ Poxotw 
ee 2 x 2HN : 
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equal to the arithmetic mean of all P,;. It is clear that if the mean values of all 
P; are less than ¢/2, then the mean value of Po is also less than ¢/2. We now 
apply to the variable Pp the assertion that the mean value of a random variable 
cartnot be less than all its values. Then we obtain that, for at least one of the 
possible random choices of 2 x 2” code words, the value of P, must be less 
than </2, However, all the variables P,, P2,..., Pex2#N, Which are the prob- 
abilities, cannot be negative; hence it is clear that if 24% or more of them are 
found to be not less than ¢, then their arithmetic mean P, would not be less 
than ¢/2. This implies that no less than 2#% of the values of P;,i = 1,2,..., 
2 X 22% must be less than ©. The chains A,,Aj,... Aiy, corresponding to 244 
Suitable i (such that P; < «) we take as 2”" code words needed by us. In other 
words, we shall transmit them alone through our communication channel and 
decipher the received chains B;,Bj, . . . Biy, as if no other code words existed. 
We now note that in all those cases in which for the received chains conditions 
(A) and (B) are found to hold in relation to 2 x 2% code words, they also 
remain all the more valid when half of the code words used previously are dis- 
carded. Hence all the inequalities derived above for the error probabilities P; 
cannot worsen because of the fact that we rejected half of the initially chosen 
2 X 24% code words. This gives the desired proof and establishes specially that 
for sufficiently large N there always exists a choice of 27% code words Ai,Ai.... 
Ain, and of the method of decoding the received chain B;,B;, . . . Biy,, Such that 
the probability of decoding error is less than ©, irrespective of what code words 
were transmitted through the communication channel. 


The definition of the noisy channel capacity on p. 263 was based on the following assump- 
tion: if ¢ is the largest amount of information that can be obtained at the channel output when 
one elementary signal is transmitted through the communication channel, then on receiving ZL 
such signals we cannot obtain more than Lc units of information. This assumption seems to 
be quite natural, but its mathematical proof is nevertheless not trivial. We shall now briefly 
explain how such a proof can be deduced. 

Suppose that @ (resp. «) is an experiment consisting of determining the value of one ele- 
mentary signal transmitted through the channel (resp. received at the channel output). Then 
by assumption /(a, 8) <c. It is required to show that if 8,6... 6z is a compound experiment 
consisting of the successive realizations of experiments (8, B.,..., By (i.e., consisting of the 
successive transmission of L elementary signals), and a,0,... «zis another compound experi- 
ment consisting of receiving these Z transmitted signals, then necessarily 


I(aya,... az, iB, .. - BL) < Le. 


It is clear that we only need to demonstrate that 


T(ayag... ax, BiBy--- Bx) < F(a, Bi) + (ae, Br) +. ~~. + U(ax, Br). 


In fact, each term on the right-hand side of this inequality equals the information about one 
transmitted signal contained in the corresponding received signal, i.e., it cannot exceed ec. 
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We can restrict ourselves to the simplest case L = 2. This is possible since in the inequal- 
ity obtained a, and 8, can always be replaced by the compound experiments a,a,... az and 
BaB,--- Bz and then induction performed over the number L. As regards the proof of our 
inequality for L = 2, it can be obtained quickly by applying the triple information equation 
(see p. 92), which states that 


TQ@Y, «) + 1(B, Y) = Mey, B) + I(e, Y)- 


Putting 8 = a, y = a, and a = #,§, in this equation we get 
I (129, BBs) + f( a1, %) = 1(B:Ba as, 21) + 1(8,B,, %a)- 


We now make use of the fact that the information contained in a compound experiment By 
relative to a certain experiment « is equal to /(f, «) if the conditional probability of the out- 
come of « for a given outcome of By in fact depends only upon the outcome of 8 (see p. 89). 
In our case the conditional probability of the outcome of a, given the outcome of Bif.a,, can 
obviously depend only on the outcome of 8; exactly in the same way, the conditional prob- 
ability of the outcome of a, given tl.e outcome of 8,8,, depends only on the outcome of fy. 
Hence 


1(B,B,%5 a) = I(B1; 1), T(BiBs, Og) = 1(Ba, oe), (C) 
and since I(«,, a,) > 0 (the information is always nonnegative), we have 
I(a 23, BPs) 4 T(Qx, a1) + T(Ba, aa), 


giving the desired proof.f 
We shall now show one more method of proving Shannon’s fundamental noisy coding 


tIn deriving the equations (C) we have factually made use of the following result: the con- 
ditional probability of the outcome B,,B, of experiment a,2, given that the experiment 6,8, has 
outcomes 4A; A, (i.e., the probability of receiving a pair of signals B,B, if A,A , are transmitted) 
can be represented in the form 


P4,Aj(BeB,) = Pa,(By) x Pa(By), 


where P4,(B,) and P4,(B;) are the characteristics of the noisy channel, which are known to 
us. Indeed, this expressly implies that the outcome of a, (resp. a,) depends only on the out- 
come of 8; (resp. B2). If we now substitute these probabilities P4,4,(B,B;) in the expression 
for the conditional entropy H,g,(#1%), then by elementary transformations it can be shown 
directly that 


Fg193(41%) = Hg, (01) + Hg, (os), 
and, consequently, 
1(43%) Biba) = Hite) — Hg, py(*1%2) < 7a, Bi) + I(s, Be) 


(since H(a,42) < H(a,) + H(a«,); see p. €4). However, such a proof is found to be lengthier 
than the one ingeniously derived above. 
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theorem for the simplest binary symmetric channel.t Through such a channel we can transmit 
two elementary signals A, and A, each of them having the probability 1 — p (resp. p) of receiv- 
ing the same (resp. the opposite) signal at the output. As noted on p. 265, without restricting 
generality it can be assumed that p < 3. Sequences 4;,4;,... A;y, Of N, Signals are used as 
code words. Here all i, (where k = 1, 2,..., N,) can take the value 1 or 2, and hence there 
exist altogether 2.1 such different sequences. Suppose that e is some preassigned small number; 
the requirement is that the probability of error in deciphering any transmitted code word does 
not exceed e. Weare interested in how many code words can be chosen without coming into 
conflict with the italicized condition. It is shown below that for sufficiently large N, the poss- 
ible number K of such code words can be made arbitrarily close to 2°%1, where 


c=1+(1 —p)log (1 —p) + plogp 


is the capacity of the channel under consideration related to one transmitted signal. Since a 
message that one fixed word is chosen by us from K possible words can supply log K bits of 
information, this implies that over the channel we can transmit information at a rate as close 
as desired to C = Lc bits per unit time in such a way that the probability of error in decipher- 
ing each transmitted signal does not exceed «. Therefore, the proof of the formulated state- 
ment is equivalent to the proof of Shannon’s theorem. 

For proof, the foremost requirement is to indicate a method of decoding the obtained col- 
lection of signals which ensures that the probability of error in deciphering each code word 
will not exceed «. For this purpose, it is appropriate to make use of the Chebyshev inequality 
proved in Chap. 1.4. Using formula (****) on p. 35, it is easy to show that if 


N, = V 2N,p(1 — pe, 


then the probability p, that the error number x in decoding N, successively transmitted ele- 
mentary signals A; does not exceed M = N,p + N, Satisfies the inequality 


Po= P(x < Mp + Ni) >1——. (*) 


We further note that for fixed p and « the ratio 


No _ | 2p —P) geri 
Ny : VN, 


can be made as small as desired if only N, is chosen sufficiently large. Hence 
M = Nip + Na= Ni(p + Na/Ni) 


can be made as close as desired to N,p. In particular, when p < 4 and N, is sufficiently large, 
M = N,p + N;z is less than N,/2; hereafter N, will be taken to be so large that the preceding 
condition is satisfied. 

We now choose the first code word (which for brevity we denote by 4,) in an arbitrary 


TAs already remarked above the idea of this proof is due to Feinstein who, however, studied 
directly the general case of an arbitrary communication channel. The application of Feinstein’s 
arguments to the simplest particular case of a binary symmetric channel was examined by 
Gilbert [184] and Slepian [187]. one more variant of the simplified proof of Shannon’s theorem 
for this case can be found in Barnard [180]. 
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manner from among 21 different chains A;,A;, ... Azy,. We shall consider 4, to have been 
transmitted if at the channel output a message is received, differing from N,-term chain A, in 
not more than M elementary signals. We denote by the symbol R(A,) a collection of all poss- 
ible N,-term chains differing from the chain A, in not more than M signals. Thus, the received 
N,-term chain is deciphered as the chain 4, if it belongs to the collection R(A,); the probabil- 
ity of error in decoding A, then does not exceed e/2 because of (*). 

~ We now take up the choice of the second code word A,. We first agree to regard A, 10 
have been transmitted if at the channel output there is received an N,-term chain that 


(a) differs from A, in not more than M elementary signals; and 
(b) does not belong to the collection R(A;). 


We are interested in only such cede words Ag, the probability of whose erroneous decoding at 
the channel output does not exceed «. It is clear that this is certainly the situation if in the 
transmission of the chain A, the probability of receiving some of the chains of the collection R(A,) 
is less than e/2._ For those cases in which an N,-term chain satisfying this condition does not 
exist at all, we consider that K = !; if however, there exist N,-term chains satisfying it, we 
accept any of them as 4,. 

We act similarly also in the choice of the third code word A,. Namely, if there does not 
exist an N-term chain of signals to be transmitted such that when it is transmitted the probabil- 
ity of receiving at the channel output in place of it one of the chains belonging to either the collec- 
tion R(A,) or R(A,) is less than ¢/2, then we consider that K = 2; otherwise, we take any of the 
chains satisfying the italicized condition as the third code word A,. In analogy to this, after 
the first k code words 4,, 4,,..., 4, are chosen, as the (k + 1I)th code word we choose any 
N,-term chain 4;,, such that in the case of its transmission through the communication channel 
the probability of receiving at the channel output one af the chains belonging to R(A,), or R(A,), 
,-.-,0r R(Ax) is less than e{2. The choice of all code words is regarded to be complete when 
it is found that no new chain satisfying the condition set out here exists. When decoding the 
messages received at the channel output, we regard the ith word 4; to have been transmitted 
if the received chain is the one that 


(a’) differs from A; in not more than M signals; and 
(b’) belongs to none of the collections R(4,), R(Aq),.. . , R(A;_1). 


If, however, the received chain differs from al/ existing code words A, 4g,..., Ax in More 
than M signals, then we decode it arbitrarily (say, we agree in all such cases to consider the 
code word A, to have been transmitted). It is clear that the method employed for decoding 
the received N,-term chains of signals guarantees that when any. of the words A,, 4a, ..., Ag 
is transmitted it is decoded correctly at the channel output with probability exceeding I — e. 
Thus, what remains to verify is just that the number K of such words for sufficiently large N, 
is sufficiently large (and, expressly, that K can be made as close as desired to 2°21). 

In order to evaluate K, we first estimate the number L, of chains occurring in the collection 
R(A) (where A is an arbitrary N,-term chain). It is clear that R(4) includes 


(0) one chain 4; 


(1) ( . ) = N, distinct chains differing from 4 in one signal; 


(2) ( _ ) distinct chains differing from A in two signals; 


e . . e . . . ° . 


(M) ( na) distinct chains differing from A in M = N,p + N- Signals. 
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Hence 


ya14(%)4 (BY) et 


The number of terms on the right-hand side of this equality we estimate as the number 
M = Nip + Nz < Nj/2 (because the term | at the start cannot affect an estimate for very large 
L,). Moreover, it is known that in the sequence of binomial coefficients 


CnC) 4G) 


the terms monotonically increase up to the middle of this sequence. Hence, since M < N;/2, 


the leading coefficient in the sequence ( a ) patttey ( ie ) is the last coefficient. Hence 


N, Ni N, 
Ly<Mx (4 )< 3x (47): 
Futhermore, using inequality (**) on p. 165 and noting that 


N, — M=N,(1 — p) —N2 = Niq — Na, 


where g = | — p, we get 


rem NI 
02 (Nip + Na)NaPtNa(Nig — Na)Nvt-Na 
N, 1 en) 


a No i D+Ng ( No ya 
(> + N, a Ny 


It is further required to estimate the number L, of all possible N,-term sequences of received 
signals occurring in at least one of the collections R(A,), R(A,),..., R(A4x). We set forth our 
reasonings as follows. Let us consider the process of the transmission of all 241 possible N,- 
term sequences A,, 4y,..., Agv,, each of these sequences having the same probability 1/21 
of being transmitted.t In such a case the probability of transmission of a sequence belonging 
to at least one of the collections R(A,), R(A2),..., R(Ax) is obviously equal to L,/241 (see 


fThe examination of such a transmission process occupies in the present proof a place 
allied to the role of the random coding procedure in Shannon’s proof (see p. 292). Recall that 
for a binary symmetric channel the channel capacity is attained for the probabilities 


1 
p%™A,) = p(A2) = > 
Hence the successive transmission of signals A;, when any A, is independent of all preceding 


signals and takes its possib'e values with probabilities p°, corresponds precisely to the trans- 
mission of all N,-term chains having the same probability !/2%1, 
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the italicized definition of probability on p. 4). When all the sequences 4;, 4o,..., Agni 
of N; signals A, and A, are transmitted there are also received at the output of our binary 
channel N,-term sequences of these same signals; we denote by p(A;4,) the probability of 
‘receiving the sequence 4, when the sequence 4; is transmitted. We now agree to number the 
N,-term chains in such a way that the chains occurring in at least one of the collections R(A,), 
R(A.),..., R(Ax) correspond to the first L, numbers (i.e., we consider the chains Ay, Ap, 
»e.., Ay, occurring in at least one of these collections, where obviously L, is much larger 
than K). In such a case the event consisting of one of the first L, chains A; being transmitted 
can be represented in the form of the sum of the following L x 21 incompatible events: the 
chain A; is transmitted, where i runs through the values !, 2, ..., Z,, and the chain 4; is 
received, where j runs through all values 1, 2,..., 21 (i.e., 4; runs through all possible N,- 
term chains). Thus 


1 


L 
Nt w= P(4,A)) + p(AiA2) +... + P(A14m,) 


+ p(A24,) + p(Ag4s) +... + P(A24gnr) 


+ P(Az,A;) + p(Az,42) +... + p(AL,Agm:)- 


We now note that probability p(4,;4,) is determined only by the number of the signals of 4; 
that do not coincide with the corresponding signals of A, (i.e., the number of errors in the 
transmission that transform the chain 4; into A;). Hence, it is clear that p(A;A;) = p(45A3) 
and, consequently, 


L 
oar = p(AyAy) + p(424)) +... +7(42M14,) 


+ p(42Aa) + p(Aode) +... + P(A Ae) 
+ p(A, Ar) + P(AgAz,) +... + p(A4gn, Az). 
It is obvious also that the sum of the terms appearing in the th column on the right-hand side 


of preceding equation (i.e., appearing below each other at the jth place in each row) can be 
represented in the form 


P(A5A1) + P(AsAg) +... + PAGAL) = (Ag) [P(A 49) + P(A2/As) +... + P(Az,/AD)] 
2 om P(Ay + Ao +... + 47,14), 


where p(4;) = 1/241 is the probability of the transmission of the chain Aj, p(A,/Aj) is the con- 
ditional probability of receiving the chains A; given that the chain A, is transmitted, and 
P(A; + 49 +... + 474/43) is the conditional probability of receiving one of the first L, 
chains subject to the same condition. But it is easy to comprehend that in the transmission of 
any N,-term chain A, the probability of receiving one of the first ZL, chains cannot be less than 
e/2. In fact, if the transmitted chain A; is one of the K code words A,, 4s,..., 4x chosen by 
us, then the probability of receiving a chain of the collection R(A;) is larger than | — (¢/2) and 
hence larger than the small number €/2. Moreover, if for some other N,-term transmitted 
chain the probability of receiving the chain belonging to at least one of the collections R(A,), 
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R(A,),. .., R(Ag) turns out to be smaller than e/2, then in such a case this chain can be 
chosen as one more code word, contradicting the assumption that it is not possible to choose 
more than K code words. 

Thus, on the right-hand side of the preceding multiline equality there occur 241 columns, 
the sum of the terms of each of which is not less than (1/2"1) x (e/2). Hence, finally, 


L, I SMa ied © oN “ee 
ar? (sax 3)= + i.€., Laz? 1. (***) 


It is now quite straightforward to obtain the result we desire to prove. In fact, Z, is the 
number of chains that belong to K different (in general, not disjoint) collections R(A,), R(A,), 
, R(Ag), each of which contains L, different chains. Consequently, 


KS. 


Using the estimates (**) and (***) of L, and Ly, it is found that 


N1(P+(Ne!N1)) N, \NUla-(Nal1)) 
cea ree) (e- x) 


We know that for sufficiently large N, the ratio N,/N, becomes arbitrarily small. Since 
log K Na N, 2) 
“> t+ (et at) 8 (2+ 


N. log N; loge 
+(¢— #) bos (4- x) - y+ No’ 


it follows that log K/N, for sufficiently large N, is larger than a number arbitrarily close to 
e=!1+plogp+qlogg. But we also know that the number X cannot be larger than 2¢°N1 
(see pp. 273 and 283); hence it is seen that for sufficiently large N, the number log K/N, can 
be made as close as desired to c. As already remarked in the foregoing, it directly implies the 
validity of Shannon’s theorem for a binary symmetric channel. 


In conclusion we also present a rigorous proof of Fano’s inequality (A’) given on p. 287: in 
fact, the reasoning adduced on pp. 287-288 partially relies on intuitive notions about informa- 
tion and hence, strictly speaking, cannot be considered as a proof. Such a proof is easy to 
obtain if we attach exact meanings to all the arguments used earlier. We had based our argu- 
ments on the fact that the amount of uncertainty of an experiment ® with n outcomes A,, Aa, 

, A, having the probabilities x, ™:.....,%, is equal to the amount of uncertainty of an 
experiment y consisting of verifying whether ® had or did not have the outcome An, plus the pro- 
duct of m+, +...+%,-1 = | — x, and the amount of uncertainty of the experiment y, with 
n —1 outcomes, which represents the same experiment ® but with the auxiliary restriction that 
the outcome A,, had not taken place. However, if as usual we denote by 


Fl(my, ma, .. . » Tn) 


the quantity 


— m log m — log TM, —..-— %y lOBTy 
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equal to the amount of uncertainty (entropy) of an experiment with m outcomes having prob- 
abilities m, ™2,..., mn, then the stated assertion is formally equivalent to the relation 


Bim,» 2, -- Kn) = Alen, 1 — mn) + (I — 9) BY pt ees Soe Lo Pee |. 
; — Tk l—ftnh l1—rx, 

The validity of the preceding relation can be easily verified by direct calculations. We further 
note that we used on p. 96 an even more general relation for A(x, m2, ..., 7), Whose mean- 
ing was then explained exactly in the same way as now. 

We now assume that we know the outcome a, or @,,..., Or 4 of an experiment « consist- 
ing of decoding one text letter at the channel output. Then the preceding relation can be 
applied to the amount of uncertainty H,,(8), or Hu,(8), ..-, or H,,(8) of an experiment p 


(consisting of determining one letter of the transmitted text) when the outcome of « is known. 
We shall assume that the outcome A, with probability m, is in all cases that outcome of 8 
which coincides with the known outcome of «. Since 


#[ =. Ta re | 
1 — rn l—ty l—nr, 


is the entropy of an experiment with m — 1 outcomes, which for any values of 7, 72, 


sees Tra, 
tt, is not larger than log (mn — 1), we obtain 
Ha, (8) < A(G) + 94 log (n — 1), 
F14,(8) < h(q2) + 93 log (nm — 1). 
H,,,(B) < 4(9,) + 9, log (a — 1), 
where h(q) = H(q, 1 -- g) = —@ log g — (1 — g) log (1 — g) and 9}, 95, ---» % have the same 
meaning as on p. 284. Now, multiply these inequalities by pj, p5,...,P, respectively and 


add separately the left-hand and right-hand sides. Since A(g) is a convex function of g for 
0<q <1, by Theorem 4 of Appendix I (p. 356) we have 


Pih(91) + Pok(ag) +... + Pah(an) < h(P39, + Pogg + ~~ + PpIn) = h(Q). 
Hence the result obtained after addition can be rewritten in the form 
Ha(8) < h(q) + ¢g log (n — 1), 


which is exactly Fano’s inequality that we desired to prove. 


4.5. Error-Detecting and Error-Correcting Codes 


Shannon's noisy coding theorem forms the main result of Section 4.4. Accord- 
ing to this theorem, for any given channel with a channel capacity C = Le anda 
given transmission rate 


my=Lo<Le 


letters/unit time 
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there certainly exists a method for the choice of code words (i.e. specific ‘blocks’ 
formed of lengthy sequences of elementary signals) that allows information trans- 
mission at a rate v, such that the probability of erroneous decoding of any letter 
of the transmitted message is less than an arbitrary (but preassigned) number e. 
On pp. 276-277 it was also remarked that Shannon’s theorem can be formulated 
differently as follows : if c, < c and N is large enough, then 2°" code words of 
length N for sufficiently large N can be chosen in such a way that the probability 
of erroneous decoding of a sequence of N elementary signals received at the channel 
output is less than an arbitrary (preassigned) number « regardless of what code 
word was actually transmitted.} The latter version of the fundamental theorem 
is More apt in that it is related only to the channel but is in no way connected 
to the nature and statistical properties of the original message. Therefore, for 
the most part we shall use this version hereafter. 

Shannon’s coding thcorem is fascinatingly simple but it also suffers from a 
serious shortcoming from the practical viewpoint. In fact, it is a typical ‘exist- 
ence theorem’ and does not contain any indication of how one should choose 
code words of some acceptable length N in order to assure a sufficiently small 
probability of error versus a given quite high (i.e., quite close to v = L(c/H)) rate 
of transmission. The problem of determining a practically convenient method 
for the choice of code words for different noisy channels forms the content of 
coding theory, which developed after the appearance of Shannon’s basic work 
[21] into a vast (and greatly important for application) independent discipline. A 
large variety of different approaches and methods are being used here, often 
borrowed from branches of modern mathematics that are seemingly highly abs- 
tract and detached from practical inquiries.tt Several tens, if not hundreds, of 
textbooks and monographs (of which [190], [193], [204], [209]-(212], and [215] 
are only a few examples) as well as several thousands of papers are devoted to 
the exposition of this science. Coding theory is also expounded in especial 


fIn Section 4.4 the length of code words was usually denoted by N,, since N was used there 
for denoting the length of encoded ‘blocks’ of the original lettered message. However, in the 
present section the original message is generally not considered; hence it will be convenient here 
to consider N as the code-word length. 

TfThis fact is reflected in the title of the interesting popular article [208] by the well-known 
American mathematician, N. Levinson, viz. ‘Coding theory: aCounter-example to G.H. Hardy’s 
Conception of Applied Mathematics’. The fact is that the famous English mathematician 
G. H. Hardy in his book, A Mathematician’s Apology, written in 1940 (and subsequently re- 
printed many times), divided mathematics into ‘pure’ mathematics, which is a source of great 
aesthetic delight due to its harmony, logical regularity and elegance but is useless in practical 
life, and ‘applied’ mathematics that is needed for practice but is tedious and rather trite. It is 
precisely some of the most typical (according to Hardy’s opinion) branches of ‘pure’ mathe- 
matics, (say) number theory or the theory of Galois fields that were later assigned a ¢entral 
rol¢ in (indisputably applied) coding theory! 
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sections of many general textbooks on information theory, applied algebra and 
combinatorics (see, for example, [2], [8], [11], [25], [191], [201] and [206]) and 
numerous review papers (for instance, [187], [195], [197], [208] and [219]). In 
the present text it is obviously impossible to cover even briefly just the funda- 
mentals of modern coding theory. However, some relatively simple conclusions 
related to this theory can nevertheless be examined. 

A few clarifications are useful as a starting point for understanding just the 
posing of problems in the coding theory. It is often asserted that all existing 
proofs of Shannon’s fundamental theorem are ineffective, i.e., even in principle 
they cannot be used to determine a method that allows us to choose the code 
words (and a method of appropriately decoding the received sequence of ele- 
mentary signals) that assure the low value of error probability for a given 
sufficiently high transmission rate. Actually, however, such an assertion cannot 
be regarded as completely valid. 

Indeed, recall, for example, the method of proving Shannon’s theorem by 
using ‘random coding’ described on pp. 292-297. In the course of this proof, it 
was suggested to choose randomly 2 code words of length N (out of certain 
preassigned 24'8)N ‘probable’ sequences of length N) and then it was shown 
that in such a case there exists a decoding method for which the mean value of 
the probability of erroneous decoding is sufficiently smal] (smaller than e/2). We 
further took advantage of the fact that it is always the case that at least one of 
the values of a random variable does not exceed its mean value; for proof of 
the theorem this was quite adequate for us. But it is also possible to make much 
further headway in this direction; it is clear that if the mean value of a non- 
negative random variable is quite small, then not one but almost all its values 
must be comparatively small. - The latter circumstance finds its mathematical 
expression in the Chebyshev inequality (**) proved on p. 33. According to this 
inequality for any nonnegative random variable « 


P(e >c)< >. where a = Mv. a. 


Hence, if a = m.v. « is so small that Ma also still remains small, where M is 
some comparatively large number, then the value of « does not exceed a small 
quantity Ma with very great probability (greater than 1 — 1/M). Proceeding 
from similar arguments, it can be shown that if we make use of random coding 
(and the decoding method described on p. 293), then for sufficiently large N the 
probability of a decoding error (and not only its value for some specific but un- 
known choice of the 2°" code words) is with very high probability (i.e., ‘almost 
surely’) extremely small. This gives us a seemingly very simple method for 
the choice of code words, which practically always leads to a small probability 
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of error.t For this, it is only necessary to take N sufficiently large and then 
choose randomly 2°1% code words of length N (by means of the urn experiment 
described on p. 292). 

But how can this ‘simple’ method actually be used in practice? Obviously, 
for obtaining good results here it is usually necessary to prescribe that N be at 
least of the order of many tens or even hundreds. If we assume that N = 100 
and c, = 0.5, then it is necessary for us to choose randomly 25° = 1015 distinct 
sequences of 100 elementary signals and all of them must be memorized. How- 
ever, this itself is the smaller part of the assignment, for incomparably greater 
difficulty is encountered in decoding the received sequences of elementary signals. 
By what was stated on p. 278 et seq., for such decoding we must examine all 2°° 
groups & corresponding to our code words to ascertain to which of them the 
received sequence of signals belongs and to which of them it does not, which 
poses a problem beyond the capabilities of all existing (and even those likely to 
appear in the near future) computers. 

It is thus seen that the basic dilemma in coding theory is mainly that in 
general it is impossible to indicate a coding method (i.e., the method of choice 
of 2%" code words of length N) and a decoding method (i.e., a method of suit- 
ably deciphering the received sequences of N signals) that assures a high trans- 
mission rate and at the same time a small probability of error. The most essential 
requirement here is that both the coding and, what is particularly difficult, the 
decoding must be made comparatively simple in practice. It is not easy to meet 
this specification. This persisting difficulty has precisely motivated a vast number 
of investigations devoted to the development of various practically acceptable 
methods of coding and decoding, which even if not optimal (i.e., the best of all 
possible) are nevertheless sufficiently good (i.e., allow us to achieve a relatively 
high transmission rate without a large probability of error). 

For the sake of simplicity, we confine ourselves to only a binary channel, i.e., 
we consider a channel over which we can transmit only two elementary signals, 
(say) on—off current, and such that the same two signals are obtained at the 
channel output. Denote by the digits 0 and 1 the signals to be used; in such a 
case all code words are sequences of these digits, i.e., the numbers of a binary 
system. Code words of length N must be chosen here from the set containing 
all 2% distinct N-valued binary numbers, the sequences a,@,.. . aw-1, where all 
a;,,i=0,1,...,N—1, take the value 0 or 1. The collection of all chosen 
code words is now called a code. If we accept al] 2" distinct N-valued numbers 
as code words, then the information transmission rate will be the highest 


{The term ‘practically always’ means here that the chosen code can fail only in the highly 
improbable case with ‘exceptionally bad luck’. But if N is sufficiently large, then this possibi- 
lity can be ignored. Moreover, even in the case of such failure the situation can be saved : if 
we are convinced (by means of a transmission test) that the chosen code is bad, it is possible 
to simply discard it and choose all code words afresh by means of the same method, 
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(namely, L bits/unit time, or equivalently L/H letters/unit time), but then there 
is no opportunity to determine at the channel output whether the transmission 
errors have taken place and how many, and specifically what signals have been 
received in error. If, however, we restrict ourselves to a smaller number of code 
words, then the resultant ‘code redundancy’ can be used for further transmission 
of some information about the distortions induced in the channel. Thus, for 
instance, we can use to advantage the simplest method of N- multiple repetition 
of each e!ementary signal (i.e., employ as a code only the two simplest code 
words 00... 0and Il... 1 of length N), and decode a sequence of length N 
received at the channel output as 00... 0 if it contains more 0’s than 1’s and 
as 11 ...1 otherwise. It is clear that such a transmission method when N is 
sufficiently large (and subject to the conventional restriction that the probability 
of the distortion of an elementary signal in the process of its transmission is less 
than 4) assures us of quite low probability of erroneous decoding of a transmit- 
ted message. However, the transmission speed will also be quite low here 
(during the time N/L required for the transmission of N elementary signals, 
only | bit of information is transmitted, which corresponds to a transmission 
rate of L/N bits/unit time = L/HN letters/unit time). It is natural that in many 
cases we shall not be able to manage with such a low transmission speed. Hence, 
the classes of codes intermediate between the two extreme codes considered 
above are of greatest interest to us. Such intermediary codes are amenable to 
rather high transmission rate and simultaneously allow us to correct many dis- 
tortions in the transmitted message. 

The simplest method of increasing the transmission reliability by a multiple 
repetition of each elementary signal allows us to explain some important notions 
of coding theory. A code is called the error-detecting code if it permits to detect 
transmission errors, and the error-correcting code if it permits not only to detect 
an error, but also to determine this error, i-e., to reconstruct correctly the trans- 
mitted signal. It is clear that error-correcting codes are more useful than error- 
detecting codes, but usually the latter codes are much simpler. If, however, 
the probability of error is small, then even the possibility to detect the error is 
of great value. In fact, if it is known that errors are involved in the reception, 
then we can simply ignore the obtained message or, if acceptable, require the 
transmission to be repeated. It is clear that (say) triple repetition code, which 
codes elementary signals 0 and | as triplets 000 and 111 and decodes the received 
triplets according to the ‘majority rule’ (i.e., for example, 000 and 010 are 
decoded as 0 and 110 as 1), allows to correct any single error (but not double 
error) and to detect any single or double (but not triple) error. Let us suppose 
that this code is used for transmission through a binary symmetric channel 
sketched in Figure 18. If the error probability p is equal to 0.01 (i.e., 1% of the 
transmitted signals is received in error), then our method of coding makes the 
probability of error-free deciphering of each triplet as high as 0.9997 (i.e., the 
frequency of errors is close to 0.03%). The probability of detecting an error 
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becomes here close to 0.999999 (i.e., the frequency of a missed error is close to 
0.0001%). Of course, the code triples the time of transmission, but nevertheless 
it is clear that it can be quite useful when error probability p is low and the 
time for transmission is not of great importance. 

The error-detecting and error-correcting properties of code considered above 
are closely connected to the fact that the code uses only a small part of all 
three-term signal sequences as code words and the selected code words differ 
here considerably from each other. In general, it is clear that if all code words 
have the same length N and any two of them differ not less than in d elementary 
signals, then the code permits to detect in a block of N signals any number of 
errors which is less than d. In fact, such number of errors certainly changes 
the transmitted code words into a block of N elementary signals which is not a 
code word at all. Moreover, if we decode any received block of N signals as a 
code word differing from it in the least number of elementary signals (or one of 
such code words, if there are several of them), then we shall correct any number 
of errors which is less than d/2. (This is clear since two distinct code words, 
both of which differ from a received block in less than d/2 elementary signals, 
cannot exist.) In the above example of a code with two code words 000 and 
111, evidently, N = 3 and d = 3; hence the code permits to detect any number 
of errors which is less than 3 (i.e., equal to 1 or 2) and to correct the errors 
whose number is less than § (i.e., equal to 1). 

The multiple repetition method for increasing the transmission reliability is 
in fact used seldom since it is quite far from being optimal. The following com- 
paratively general method of the use of code word redundancy for the transmis- 
sion of information about the distortions is used much more frequently. The 
number of code words of length N chosen here is 24—! (i.e., is equal to half of 
the number of all distinct sequences of N binary signals). Let us agree to form 
2-1 code words of all possible sequences aya, .. . dv-, of N — 1 digits 0 and 
1, but the Nth digit ay-, is so chosen every time that the sum a, + a,+... 
+ @y-, is even. In such a case, the presence of a single error (i.e., when 
one of the received N elementary signals is in error) leads to the emergence of 
sequences a’a’ .. . dy-3 at the output such that the sumap + a, +... + ay-1 
is odd (since the possible distortion is that either 0 is taken for 1, or 1} for 0). 
This position enables us to detect easily the presence of a single error, even 
though it does not allow us to ascertain what specific signal is received in error 
(precisely, the property of the sum aj + a; + ...-+ ay-, being odd indicates 
that an odd number of signals has certainly been received in error, but the code 
does not permit the even number of errors to be detected). Nevertheless in those 
cases for which in the transmission of N signals, the probability of the appear- 
ance of more than one error is extremely low, the highly simple coding method 
described here is indeed of great value. In fact, if it is known with certainty 
that errors are involved in the reception, then we can simply ignore the obtained 
message, or if we wish we may require the transmission to be repeated. On the 
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other hand, the transmission rate in such a coding method still remains quite 
high; with a maximal value of L bits/unit time it decreases altogether to just 
[(N — 1)/NIZ bits/unit time = [(N — 1)/N] (L/A) letters/unit time. 

The ‘parity check’ method described above can also be applied several times, 
and this enables us in many cases not only to detect the presence ofan error but 
also to correct it. Consider, for instance, the case in which N = 3 and the 
number of code words to be employed is2. It is clear that in such a case it is 
reasonable to choose the triples 000 and 111 as code words; such a choice from 
the viewpoint of using ‘parity checks’ can be justified as follows. We form two 
code words on the basis of two possible values of the first elementary signal a 
(i.e., we consider that only the signal a) actually contains the information). 
Furthermore, we agree to transmit after each ‘information’ signal ag two more 
‘check’ signals a, and az so chosen that both the sums ao + a, and ao + ay are 
even (it is easy to see that this precisely reduces to the choice of the frequencies 
000 and 111 as code words). In such a case it is seen that if only in the received 
triple signals two or three errors do not occur (i.e., if only a correct transmis- 
sion and a transmission with a single error are considered possible), then by the 
parity check of the sums aj + a4 and a) + a} in the triplet apajaz received at 
the output it can be ascertained without error what specific triplet was actually 
transmitted. In fact, if both the sums a, + a; and aj + a‘ are found to be 
even, then it directly implies that there is no transmission error (recall that the 
possibility of double error is excluded). If, however, only one of them is odd, 
then this means that the check signal a, or a, occurring in this sum is received 
in error, but if both the sums ap + a and ay -++ a; are odd, then this implies 
that the information signal ag is received in error. Thus, at the price of decreas- 
ing the transmission rate by a factor of 3 (as compared to the maximal rate L 
bits/unit time) it is possible to achieve correction of all single errors in triplets of 
elementary signals. 

The result derived above is obviously trivial (it is clear that by taking the 
triplets 000 and 111 as code words, we can achieve correction of all single 
errors), but it can be extended also to cases of many larger values of N. Thus, 
for instance, if N = 7 and the number of code words is 16 = 24, then we can 
take the first four signals ao, a,, a, and a, as information signals (since the num- 
ber of distinct quadruples dga,a,a, is exactly sixteen), and choose the last three 
‘check signals’ a,, a, and a, such that the sums 


Sy = +4, +4,+ , So = Ag + a, + @3 + as, 
and 
Sg = Aq + Gy + dy + a 


are even. Here the ‘parity check’ of the three sums s,, s, and s; at the channel 
output also allows us to determine uniquely whether an error has been admitted 
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in receiving (subject to the condition that the possibility of two or more errors 
in receiving seven signals is ignored) and, if it has, then in which signal it is in- 
cluded. In fact, if one of the 7 signals is received in error, then at least one of the 
sums must surely be found to be odd, so that the parity of the three sums posit- 
ively indicates that there has been no transmission error. Furthermore, only 
one sum will be odd in that (and only that) case in which one of three ‘check 
signals’ (a,, a, or a.) occurring in the sum is received in error. Finally, the non- 
parity of two of the three sums s,, s, and s, means that out of a,, a, and a; that 
signal which occurs in both these odd sums is received in error, and the non- 
parity of all the three sums implies that the first signal a9, occurring in all the 
sums, is received in error. It is easy to see that the 16 code words of length 7 
in the given case have the form 


0000000, 1000111, 0100110, 1100001, 
0010101, 1010100, 0110011, 1110100, 
0001011, 1001100, 0101101, 1101010, 
0011110, 1011001, 0111000, 1111111. 


The use of these code words yields the transmission rate 


re Dae 4L plat 
ar bits/unit time = Ltd letters/unit time, 


and at the same time allows us to correct all single errors (but not errors of higher 
multiplicity!) in ‘blocks’ of six elementary signals. 

The corresponding code is, of course, not the ‘best possible’ but since both 
coding and decoding are carried out here without much difficulty, it fully just- 
ifies its practical usefulness. Let us consider again, for instance, a binary sym- 
metric channel, in which the probability of receiving in error each of the two 
employed elementary signals is 0.01. The capacity of such a channel is given by 


C = 0.92L bits/unit time 


(see p. 265). Hence, here definitely exists a code that allows us to transmit 
0.92L bits of information per unit time and is such that the probability of a 
decoding error is less than an arbitrary preassigned number e (which can be 
chosen as small as desired). However, how to construct such a code we do not 
know; furthermore, if « is taken extremely small, then apparently code words 
of corresponding code will be very lengthy and the code itself will be extremely 
complex. Let us now try to use the very simple code described above with 
N = 7, in which to every four signals to be transmitted are added three further 
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check signals. Here, we transmit information at the rate 
4 : ae 
7 L = 0.57L bits/unit time, 


which is appreciably lower than the limiting rate of transmission without error; 
in addition, the probability of a coding error here is obviously not ‘as small 
as desired’ but is equal to the probability that out of seven transmitted element- 
ary signals two or more are received in error. Starting from here, it can be 
calculated that in such a transmission method in a sequence of ‘elementary in- 
formation signals’ slightly less than 0.00Ith of the signals are obtained in error 
at the channel output so that the probability of receiving one elementary signal 
in error is here slightly below 0.001. It is seen that the probability of receiving 
one elementary signal in error is reduced in this case to less than ;';th that for 
transmission not using ‘check signals’. Since in this case both coding and decod- 
ing are highly straightforward and can be easily automatized, from a practical 
viewpoint the use of the described code unquestionably merits consideration. 
It may be noted further that the examples described here of ‘single-error- 
correcting codes’ are quite intimately related to the content of the problem anal- 
yzed on pp. 107-108, in which it was supposed that among n given numbers either 
one number or none was thought of and it was required by means of the least 
number of questions (answers to which could be only ‘yes’ or ‘no’) to clarify 
whether or not a number was thought of and if yes, what number specifically. 
It is now convenient that instead of n numbers we consider N indices 0, 1,..., 
N — 1 appearing as subscripts to the code word aga, . . . @n—1; such a substitu- 
tion obviously does not affect our arguments. By what is stated in our ex- 
position on p. 108, it is required here to put not less than log (N + 1) and not 
more than log (N + 1) + 1 questions; but our ‘parity checks’ are in fact equi- 
valent to some questions (since each check can give two results : ‘even’ or ‘odd’, 
in analogy to ‘yes’ or ‘no’ answers to a question). In Chap. 3 answers to the 
questions contain definite information about the number thought of, since these 
were put by a person to whom this number was known. Similarly, in order 
that the result of a ‘parity check’ contains information about the possible distor- 
tion in transmission, it is necessary to know in advance that the sum of the 
signals to be transmitted is even or odd. Since in general it cannot be known 
what signal is transmitted, the preceding condition can be satisfied if and only 
if each sum to be transmitted contains at least one ‘check signal’, which we agree 
beforehand to choose such that the corresponding sum is found to be (say) even. 
It is thus clear that the number of ‘check signals’ that must be added coincides 
with the minimal number of ‘parity checks’ needed, i.e., it is equal to the num- 
ber of those questions of which we spoke on p. 108. If, for instance, N = 3, 
then the number of questions cannot be less than log (3 + 1) = log 4 = 2; this 
also corresponds exactly to the fact that in the example of the single error-cor- 
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recting code described on p. 310, each ‘information signal’ ay to be transmitted 
is adjoined to two additional ‘check signals’ a, and a,. We further note that 
since the signals a, and a, are so chosen that the sums do + a, aNd ay + a, are 
even, the parity checks of the corresponding sums at the channel output are 
equivalent to the answers to the questions of ‘whether or not the pair of receiv- 
ed signals a) and a, contains an error’ and of ‘whether or not the pair of sig- 
nals ay and a, contains an error’. It is clear that answers to such questions 
allow us to determine uniquely any single error. In analogy to this, if N = 7, 
then the number of required questions (i.e., ‘parity checks’ and ‘check signals’) 
cannot be less than log (7 + 1) = log 8 = 3; this is exactly what is shown on 
pp. 310-311. Taking recourse there to the parity check of sums 5, 5, and Sq is 
equivalent to the answers to the questions of ‘whether or not the received sig- 
Nals dy, a, 22 and a, contain an error ?’, ‘whether or not the signals ao, a;, dy 
and a; contain an error ?’ and ‘whether or not the signals a), @,, @, and a, con- 
tain an error ?’. It is obvious that answers to these questions also uniquely 
determine the erroneous signal if it exists. 

In the general case of code words of length N, the number K of ‘check sig- 
nals’ of a code needed to correct all single errors, must satisfy, by what is stated 
above, the inequality 


log(N+ 1 <K<log(N+)D4+1, 
so that 


2k-l-—- 1 < N<&2 — 1. 


The number of ‘information signals’ here is then equal to N ~— K. A code that 
uses code words of length N which consist of M = N — K ‘information signals’ 
and K ‘check signals’, carrying no information but used for parity checks, we 
call an (N, M)-code.t The information transmission rate associated with such 
code is obviously L(M/N) bits/unit time = L(1 — K/N) bits/unit time. In the 
case considered K < log (N+ 1) + 1, so that K for large N is considerably 
smaller than N; hence the transmission rate for large N is here quite close to 
the maximal rate of L bits/unit time. We see that the code under consideration 
when N is large assures a quite high transmission rate. Obviously, it is never- 
theless not preferable to choose an extremely large N, because in that case the 
probability of the presence of several (more than one) errors in a block of N 
signals is sharply increased, i.e., the reliability of the code is reduced. In prac- 
tice we have to resort to a compromise and choose some intermediate (neither 


+Therefore (say) the Shannon-Fano code, or the Huffman code is not an (N, M)-code but 
the triple repetition code described above is (3, 1)-code. General (N, M)-codes are often called 
also block codes. It is clear that N > M for all error-detecting and error-correcting block 
codes. However, the case N = M is widely used in cryptography where the coding is used only 
to make the message unintelligible for the uninitiated. 
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exceedingly large, nor yet too small) value of N. Also, the method of choice of 
‘check signals’ for a general (N, M)-code, where M = N — K, correcting all 
single errors, can be set up via the path indicated on p. 108 for the problem of 
guessing a thought of number; we shall not dwell upon this here, because we are 
going to indicate below an entirely different method for the construction of the 
required code. We may further remark that the case of a single-error-correcting 
(7, 4)-code analyzed on pp. 310-311 was considered by Shannon [21] by way of 
an example; the general single-error-correcting (N, M)-codes were examined in 
1950 by Hamming [203] and are usually called the Hamming codes in the lit- 
erature. f 

Similarly, we can also approach the problem of the construction of a double- 
error-correcting code which corrects all single and double errors. Assume (say) 
that N = 5,and that we ignore the possibility of the simultaneous distortion of 
more than two of the five signals but prescribe that the code enables us to correct 
all distortions for the case in which their number does not exceed two. This 
setup leads us to the problem of determining n < 2 thought of numbers among 
some five numbers. By what was stated on p. 108 for determining these num- 
bers, it is required to pose not less than 


tog[( 3 )+( 7) +1 ]= tos 0 +541) = tog 16 = 4 


questions; hence this specifies that at least four parity checks must be carried 
out and implies that of every five signals a), a,, a, @, and a, at least four must 
be ‘check signals’. It is not difficult to see that in the given case four check 
signals indeed suffice for solving the problem. It is, for example, possible to 
choose these signals a,, a, @3 and a, with the restriction that the sums 


5, = Q +, S,=Q)+ a, S3= ay + a, and 4 = a+ a 


be even. In such a case the parity of all sums considered at the receiving end 
of the channel means the absence of errors; the nonparity of one sum 5s, implies 
the only corresponding signal a; to be in error; the nonparity of two sums s,; and 
sj, the signals a, and a; to be in error; the nonparity of three sums, say all except 
s;, the signals a) and a; to be in error; the nonparity of all four sums, only the 


It is, however, quite common to refer to only such single-error-correcting (N, M)-codes, 
in which N = 2¥ — |, as the Hamming codes (i.e., those in which N takes the greatest value 
possible for a given number K of ‘check signals’). These codes possess significant properties, 
which we shall state at the close of this section. It is interesting to note that such (2 — 1, 
2K — K — 1)-codes were examined as early as 1942 (i.e., prior to the appearance of Hamming’s 
and even Shannon’s works) by the famous English statistician R. A. Fisher (see Berlekamp 
[190], Section 1.3) in an entirely different context (not formally connected to coding theory 
but equivalent to it). 


4.5. ERROR-DETECTING AND ERROR-CORRECTING CODES 315 


Signal ay to be in error.t 

In the general case of a double-error-correcting code with an arbitrary number 
N of signals in every code word, the results derived on p. 108 show in exactly 
the same way that the number K of ‘check signals’ and the ‘parity checks’ corres- 
ponding to them must satisfy the inequality 


K>tos|(} )+( 7) +1] = tog M2. (*) 


However, the question as to which specific ‘check signals’ should be chosen here 
(i.e., which ‘parity checks’ take us fastest to our goal) is not easy to answer in 
this case and thus a solution of the corresponding problem of guessing a number 
does not yield a general method of effectively constructing a suitable ‘error- 
correcting code’. In analogy to this, in a still more general case of codes that 
enable us to detect and correct in a sequence of signals of length N any number 
of errors not exceeding a given n, the reasonings deduced on p. 108 allow us to 
Say that the number K of ‘check signals’ (and the ‘parity checks’ corresponding 
to them) required for this sequence must satisfy the equality 


katos[( 7) +X, j++} (**) 


This straightforward conclusion is due to Hamming [203], and hence inequality 
(**) for the number K is frequently called the Hamming inequality or the Ham- 
ming lower bound on the number of ‘check signals’ of an n-error-correcting code. 
If 2 = 1, then the Hamming inequality (**) reduces to the result N< 2* — 1 
already known to us; here equality is attained for the Hamming codes with 
N= 2* — 1. But the arguments set forth on pp. 107-108 in the general case 
do not indicate how we should choose the ‘parity checks’ we need (i.e., how to 
construct a code with the requisite properties); furthermore, they do not even 
allow us to state that for any K satisfying the Hamming inequality (**) there 
indeed exists a ‘parity-check code’ that contains K check signals and enables us 
to correct any number of errors less than 7 in a ‘block’ of N signals (in fact, for 
certain K satisfying this inequality, it is impossible to construct the requisite code). 
An estimate of the number of K ‘check signals’ that is clearly sufficient for it to 


fIt is easy to comprehend that the ‘parity checks’ described are equivalent to the answers 
to questions : ‘shall the number of errors be even—when the signals a, and a, are received?’; ‘when 
the signals a, and ag are received)’; ‘when the signals a, and a, are received?’: and finally, ‘when 
the signals a, and ag are received?’ Here the answer to: the first question separates from 16 
distinct possible ‘outcomes’ of the transmission, in which not more than two elementary signals 
are distorted, a group of 8 admissibie outcomes, i.e., contains the largest possible information; 
in the same way, all succeeding questions also extract exactly half of the remaining number 
of these possible ‘outcomes’. 
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be possible to detect and correct any number of errors less than n in a block of 
N signals obtained from completely variant arguments is due to Varshamov 
[218], who showed that for 


K>tos[( 4) +(ay 2 te 8] (***) 


we can always construct a ‘parity check code’ having the requisite properties. The 
inequality (***) (making sharper the preceding related result due to Gilbert 
[201]}) is called the Varshamov-Gilbert inequality or the Varshamoy-Gilbert upper 
bound on a number K of check signals of an n-error-correcting code; a simple 
proof of this will be given later in this section. If n > 1, the Varshamov-Gilbert 
upper bound is in general found to be greater than the Hamming lower bound. 
Thus, here, there are the values of the number K of ‘check signals’ for which the 
corresponding inequalities do not exclude the possibility that an n-error-correct- 
ing (N, W — K)-code exists, but at the same time they do not allow us to assert that 
such acode necessarily exists. In addition, all proofs of the Varshamov-Gilbert 
inequality, although relying on a definite method of constructing the required 
code, make no claim to the effect that this method can be successfully applied 
in practice. In fact, the existing construction proofs are found to be completely 
inadmissible for actual use (they are all based on Sorting out of an enormous 
number of possibilities). 

Even for the simplest case in which n = 2,a practicable method of construct- 
ing a ‘parity-check code’ that allows the correction of any single or double error 
in a block of an arbitrary number N of signals was not found until nearly ten 
years after the appearance of Hamming’s work [203] describing a general single- 
error-correcting code. In this connection, the reader is referred to Bose and 
Ray-Chaudhuri [192] and Hocquenghem [204] where, surprisingly, the tools 
used for this purpose are found to belong to a subtle and quite complicated 
mathematical apparatus involving abstract algebra. We shall revert to Bose- 
Chaudhuri-Hocquenghem codes at the end of this section. A subsequent gene- 
ralization of these methods, allowing us to construct codes correcting any num- 
ber of errors less than a given number n, proved to be comparatively simple and 
was obtained at practically the same time the codes correcting not more than 
two errors were determined. 


In order to give an idea of the method of constructing codes correcting not 
only single but also double (or generally multiple, not exceeding a given multi- 
plicity) errors by the parity-check results, the first prerequisite is to define rigor- 
ously the notion of a ‘parity-check code’. From this objective, a convenient 
starting point is to regard all arithmetic operations with the numbers 0 and 1 
as operations that can have only two possible results, 0 and 1 symbolizing the 
fact that as a result of the operations we obtain an even number and an odd 
number, respectively. This leads us to the accompanying table listing the results 
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of all possible arithmetic operations carried out on the numbers 0 and 1: 


0+0=0,041=1,14+0=1,1+1+0; 
0oxo0=0,0x1=0,1x0=0,1x1=1. 


It is easy to see that the operations so obtained are ‘addition’ and ‘multiplica- 
tion’ (which we call addition and multiplication in an arithmetic with two symbols, 
or, in short, in 2-arithmetic), satisfying all rules of ordinary arithmetic.f This 
fact manifests itself if we say that a collection of two numbers 0 and 1, for which 
the addition and multiplication conventions in 2-arithmetic are defined, forms a 
field of two elements (or a binary field; cf. Appendix II).tf 

It is now easy to describe a general (N, M)-parity check code. It is charac- 
terized by K = N — M relations of the form 


au = bu,qQ as bua, eres bm,m-14mM-1; 


m+) = ba tisoQo + bu +1914) +... + bu4i,M-14m-1, 


(1) 


Gy -1 = by-159% + bn-1y1d, +. + Ow_eism-18mM-1- 


Here all coefficients 


busw bs, eee gy bu,m-1; datas ’ by-1,05 ore | by-1sM-1 


are elements of our field of two elements (i.e., the number 0 or 1), and all arith- 
metic operations appearing in these equations are understood in the 2-arithmetic 
sense (so that each equation means only that its left-hand and right-hand sides 


{Strictly speaking, ‘multiplication’ in 2-arithmetic can be written without quotation marks 
since it does not differ from customary multiplication. Contrarily, ‘addition’ in 2-arithmetic 
varies from ordinary addition because here 1 + 1 = 0 (because of which this addition is some- 
times denoted by a special symbol, e.g. + or + ). 

ttThe fact that a collection of distinct elementary signals can be considered as a collection 
of all possible elements of a certain finite field is of great importance for all of modern algebraic 
coding theory. However, in algebra it is demonstrated that a field with a given number m of 
distinet elements exists if and only ifm is a power of a prime number (i.¢., equals p*, where p is 
prime; see Appendix II). Hence, algebraic coding theory can be applied directly to a non- 
binary linear channel (which we shal! not consider here at all, however) only in the case in 
which the number m of distinct elementary signals that can be transmitted over the channel 
have the form p*. If this is not so, then we have to take further recourse (o some tricks (for 
instance, we might never use some of the admissible signals). 
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taken in the ordinary sense have the same parity). The K parity checks 
corresponding to our (N, M)-code are parity checks of the sums of check signals 
a; (where i takes K= N— M values M,¥M+1,...,M+K—-1=N-—1) 
and those of the information signals a,, a,, . . . , @y—1 which correspond to co- 
efficients b:,9,b;,;, . . . » bs,m_1 equal to 1 (but not 0!).t For defining a code, it 
suffices to indicate all coefficients b;,; entering the equations set forth. It is 
appropriate here that all the left-hand sides ay, @ys+,,.- - , @y-, in these equa- 
tions be transferred first to the right-hand sides (taking account of the footnote 
on this page), and then that all coefficients in the obtained equations be arranged 
in the form of a table of K = N — M rows and N columns, at the intersection 
of whose ith row and jth column appears the coefficient of a; in the ith of our 
equations. It is easy to see that such a table has the form 


bm.o by, ipa. 190! 204° 

bm+i1,05m4i1- + - butwm-19 1 ... 0 
(2) 

. bn-150 bya foes by-1m190 0 ...1 


A rectangular array of m rows and n columns is called in mathematics a marix 
of m rows and n columns, or briefly an (m x n)-matrix; thus, a general (N, M)- 
parity-check code is given by a (K x N)-matrix of 0’s and 1’s of the specific 
form (2). A callection of all possible code words of such a general (N, M)- 
parity-check code can easily be described thus : the information signals a), @,, 
«+.» Qy-, can be arbitrary here (i.e., each of them can take, regardless of the 
others, both the values 0 and 1), but the check signals ayy, Qag+,,.-., @n-, are 
uniquely defined by the information signals with the aid of equations (1), under- 
stood in the sense of 2-arithmetic. The total number of distinct code words in 
this case is obviously 2@ = 24-K, 

Note that sometimes a parity-check code is also defined rather more broadly 
as acollection of N-term sequences aga), . . . , dy-, Of symbols 0 and 1 such that 


fRecall that in 2-arithmetic | + 1 =0 and hence —! = 1. Therefore, when a term is 
carried over here from one sidc of the equation to the other side it does not necessarily change 
its sign, and the equation x = y can be written both as x — y=0 and x + »y =0 (both the 
relations are equivalent to each other, implying just that x and y have the same parity). 
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the numbers dp, a;,...,a@y-, satisfy K relations of the form 


bm + bay, + eee + buyn-14n-1 = 0, 
by4+1590o + by415141 Reker e Ae beii,y—-14N-1 = 0, 


(1) 


by- 1,040 + By 1518) SP eee by-1,n-14N-1 = 0 


(where the coefficients again take only the values 0 and 1, and the equations are 
understood in the sense of 2-arithmetic). The matrix corresponding to the most 
general code (1’) is an arbitrary (K xX N)-matrix consisting of 0’s and 1’s. Bear- 
ing in mind this broader definition, a particular code given by relations of the 
form (1) and matrices of the form (2) is called a systematic parity-check code. 
It is not difficult to show, however, that an arbitrary parity-check code can al- 
ways be written as a systematic code with the number of ‘check signals’ not 
exceeding the number K of relations (1’) (see Appendix II). Hence, as a rule, 
in the following we shall speak only of systematic codes. 

In the literature on coding theory, parity-check codes are also often called 
linear codes or group codes. Both these terms are related to auxiliary proper- 
ties of the considered codes, which are of interest in their own right and highly 
important if it is desired to carry over the theory of such codes to more general 
nonbinary channels (for which the notion of parity check has obviously no 
direct meaning). In order to clarify what these properties consist of, it is neces- 
sary to consider the operations of addition and multiplication by a number z 
(belonging to our field of two elements, i.e., equal to either 0 or 1) of N-tuple 
blocks a = (ao, @;,..., @y-,) Of 0’s and 1’s. These operations can be conven- 
tionally described as follows : 


(4p, a, ae | an-_1) F (a,, a;, eens an-1) = (ao a, ay a ai, oe ae an-} + ay-1), 
ZX (ao, a, ..., Ay-,) = (Zao, 20), ... , 24n-). 


We note incidentally that since all arithmetic operations are understood here in 
the sense of 2-arithmetic, the operation of multiplication of a block by a num- 
ber is of no singular interest; for any block a = (a, a,,..., @v-,) evidently 
0 X a = Oand | X a=a, where 0 = (0, 0,..., 0) is a zero block formed 
of N zeros. 

It is not difficult to verify that the operations of addition and multiplication by 
a number so defined satisfy all the basic rules of ordinary arithmetic operations. 
To express this fact in the language of modern algebra, we say that a collection 
of all possible N-tuple sequences a = (dp, a,,..., @y-1) Of 0’s and 1’s forms 
a vector space (precise definition of a vector space is given in Appendix II), 
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On the other hand, the fact that the operation of adding two sequences 
has by itself (i.c., unrelated to multiplication by a number) the conventional 
properties of arithmetic operation of addition can be expressed by saying that 
a collection of sequences a = (ap, a,,..., @y-1) is a group under the addition 
operation introduced above (the definition of group is given in Appendix II). 
A code (i.e., a definite collection of code words, each of which is a ‘block’, i.e., 
an N-tuple sequence of 0’s and 1’s) is called Jinear, if its code words form a 
linear subspace of the common vector space of all such ‘blocks’, implying that 
the sum of any two code words of a linear code and also the product of a code 
word by a number z must be a code word.} A code is called a group code if 
its code words form a subgroup of a common group of sequences (ap, @,,..- 5 
ay-,); in the binary case considered here, this again means just that the sum of 
any two code words and a ‘zero block’ (0, 0,. . . , 0) must be a code word (see 
also Appendix II). Thus, it is seen that in the case of a binary channel (i.e., 
the case in which only two elementary signals are used), both the terms linear 
code and group code mean one and the same thing.{t 

Consider now an arbitrary (not necessarily systematic) parity-check code, 
whose code words coincide with a collection of sequences a = (dp, a,,..., @n-1) 
such that relations (1’) are satisfied for them. In the first place, it is obvious 
that if (ao, a,,..., @y-3) is a block of of 0’s alone, then relations (1’) are neces- 
sarily satisfied; hence, the zero block’(0, 0,..., 0) is surely a code word of the 
considered code. Moreover, if the blocks 


a— (a, Q,..+, ay-1) 


and 


a’ = (4, @,..., an-y) 


are both code words (i.e., all K relations (1’) are satisfied for both of them), 


then adding to each other the first, second, ..., last of these relations for a and 
a’ it is verified that 


’ , , , U 
a+a=(a, + do, 4 -+ a, ..., Qn + ay-;) 


tlt is clear that for the case considered in which only two elementary signals are admis- 
sible the condition related to multiplication by a number is rather trivial; it means only that a 
sequence (0,0,...,0) of N 0’s must be a code word. However, in the case of more than 
two elementary signals, the indicated condition turns out to be sufficiently important. 

+tIn the more general case of a communication channel with m elementary signals, these 
two notions are equivalent to each other if 7: = p is a prime number, but the notion of linear 
code is only a particular case of the notion of group code if m = p*, where p is a prime and 
k > 1 (see footnoteft on p. 317). Finally, ifm is not an integral power of some prime number, 
then neither of these notions can in general be defined, 
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also satisfies all relations (1’), i.e., is also a code word. This implies that every 
parity-check code is simultaneously also a linear (group) code. On the other 
hand, in algebra it is shown that any linear subspace of a vector space of 
sequences a = (a, a,,. . . , @v-,) can be defined by a certain collection of rela- 
tions of the form (1’) (see Appendix II). Consequently, the class of linear (or 
group) codes for a binary channel coincides precisely with the class of parity- 
check codes; this fact provides the justification for also calling parity-check 
codes, the linear codes or group codes. 

Let us revert to the consideration of general parity-check codes. It has already 
been remarked above that every such code can be represented in the form of a 
systematic code (satisfying relations of the form (1)); therefore, we shall deal 
here mainly with codes of the latter type. Such a code is defined by matrix (2), 
which is usually called the check matrix of a codet; for convenience let us denote 
it by a single letter B. If a= (ap, a,,..., Gn-,) is one of the code words of 
our code, the validity of relations (1) for it is conveniently represented by the 
single vector equation 


Ba == 0. (3) 


The left-hand side of (3) serves as the symbolic notation of N — M entries of the 
left-hand sides of equations of the form (1’) obtained from (1) by transferring 
all left-hand sides to the right-hand sides; here Ba is the product of the matrix 
B and the vector a, understood in the sense of matrix calculus (which is further 
dealt with in Appendix II). Suppose that the code word a = (a, @,,..., @n-1) 
is transmitted through a communication channel; as a result of distortion in the 
transmission process, a sequence a’ = (aj, a,,..., @y-,) other than the one 
transmitted is in general obtained at the output. Substitute the sequence a’ in 
the left-hand sides of equations (1') (understood, as usual, in the sense of 2- 
arithmetic) and denote by the symbol Ba’ the resultant K = N— M numbers 
0 and 1 (representing a K-term sequence (Sm, Smii,.-+» Sy-1)). Since a’ is in 
general not a code word, Ba’ = s = (sm, S5mii,..-, Sn-1) is not a zero sequence 
(i.e., it also contains 1’s at certain places). The presence of these 1’s obviously 
shows the distortion to have occurred during transmission; in the language used 
previously, each 1 impiies the corresponding ‘parity check’ to have led to a 
negative result. Let 


e = (€, é:,..., en) = (4, — a, @, — a, ..., Ay-1 — Gyr) 


be an N-term ‘error block’, containing 1’s at places corresponding to the signals 


fIn the case of general (not necessarily systematic) parity check codes, a check matrix is 
obviously an arbitrary (K x N)-matrix of 0’s and 1’s (some examples of such general check 
matrices wil] be given later). 
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a distorted during transmission and 0’s at all the remaining places, so that 
e=a'’—a=a-+a 
(recall that in 2-arithmetic a — b = a + b). It is clear that by (3), we have 
Be = Bia’ — a) = Ba’, 
Consequently, 
Be =s. (4) 


Unfortunately, in general, there exist many sequences € = (€9, €;,.-., €n-1) 
that satisfy N — M relations (4). Therefore, starting from (4) it is still not 
possible to reconstruct the ‘error block’ e (and hence also the transmitted seq- 
uence @ = a’ —e=a' +e). When decoding a parity-check code, it is usually 
assumed that the probability of distortion in transmission of each signal is 
smaller than the probability of correct transmission. In agreement with this 
the following decoding rule is set up: as an error block e is taken that sequence 
satisfying equation (4) which contains the least number of 1’s, i.e., corresponds to 
the least possible number of distortions in transmission (if among the sequences 
Satisfying (4) there are several sequences containing one and the same least num- 
ber of 1’s, then e is chosen randomly from them). This rule allows us to decode 
all N-term sequences of elementary signals received at the channel output, i.e., 
associates with all of them a definite code word a = a’ + e (condition (3) is 
obviously satisfied by a, i.e., a is in fact a code word). This code word a is then 
considered to have been transmitted over the communication channel. 

The described method of decoding a parity-check code is appreciably simpler 
than the general method set forth on p. 293 (and based on the consideration of 
the groups Y corresponding to distinct code words). Nonetheless even this is 
not practically suitable : for large values of K = N — M, finding a sequence 
satisfying (4) that contains the least number of 1’s turns out to be so tedious 
that even modern computers are unable to accomplish it within a tolerable 
time. Hence, the problem of developing sufficiently simple (i.e., attainable in 
practice) methods for finding the required block e is quite important; for the 
present it can be regarded to have been solved only for some particular cases 
of codes with highly special check matrix B structures.t| However, even with- 
out this the existence of the theoretically sufficiently straightforward general 
decoding rule indicated above can be put to use for studying the general pro- 


tOne such particular case was studied by Gallager [199]. It relates to the matrix B with 
large values of N and K = N — M which consists, roughly speaking, almost only of 0’s (i.e., 
contains just a few I’s). Some other particular algebyajc cases from algebra are described 
below, 


4,5. ERROR-DETECTING AND ERROR-CORRECTING CODES 323 


perties of parity-check codes. Such a study was inaugurated by Slepian [214]. 
Later Elias [182] showed that in the case of a binary symmetric channel (and 
also in the case of a binary erasure channel corresponding to the scheme of 
Fig. 21, where p = 0), parity-check codes are in no way inferior to the best of 
all existing codes in the sense that by means of parity-check codes information 
transmission can always be effected at a given rate C, = Lc, bits/unit time (less 
than the capacity C = Le of a channel) such that the probability of decoding 
error is smaller than any preassigned number « > 0. Moreover, the magnitude 
of error probability attainable for a fixed transmission rate C, = Lc, bits/unit 
time, where c, < c, and for code words of fixed length N, does not exceed a;*, 
where a, is 2 number depending on c, but always greater than unity. Thus, 
with increasing N the error probability decreases by the same rule as that appli- 
cable also in the case of a best arbitrary code. In addition, Elias has also 
demonstrated that if a parity-check code is chosen ‘at random’ (i.e., every ele- 
ment 5,,; of the check matrix B is chosen by flipping a coin and assuming that 
bis = 0 or 5;,; = | according as the coin comes up heads or tails), even then 
for a given channel the probability of a decoding error tends to zero as N > oo 
(and K = (1 — ¢,)N, so that 2’-E — 21%), Moreover, the error probability 
here tends to zero not more slowly than the Nth power of some number smaller 
than unity). 

The fact that for many communication channels encountered in actual prac- 
tice a randomly chosen parity-check-code for large N turns out to be sufficiently 
good (‘almost with a certainty’) provides a great attraction for the use of such 
‘random parity-check codes’. In order to define such a code, it is necessary to 
choose randomly (and memorize) MK = N?c,(1 — c,) elements b:,, (wherei = M, 
M+1,...,N—1, and j=0,1,..., M—1) of the corresponding check 
matrix B. Since the number N2c,(1 — c,) with increasing N does not increase 
very rapidly (much more slowly than, for instance, 2°14), such problem can be 
fully tackled by modern computers even for an N of the order of several hund- 
reds. However, the procedure of decoding (i.e., determining with respect to the 
received Sequence a’ the corresponding ‘error block’ e), as noted previously, is 
extremely difficult in the case of an arbitrarily chosen parity-check code and 
this substantially hinders the use of ‘random codes’. Nevertheless there exist 
definite promising approaches to the construction of practically ‘good’ coding 
and decoding methods, with inbuilt provision for the ‘random’ choice of some 
variables defining the code under consideration (by way of example, we may 
mention the so-called ‘sequential decoding’, that has been described, for example, 


~Later Dobrushin [194] (while investigating arbitrary group codes) and Drygas [196] (while 
considering more particular linear codes) extended the results of Elias related to a binary 
symmetric channel to more general channels with m = p* elementary signals and r = m (i.e., 
where the same signals are transmitted and received, provided that the corresponding prob- 
abilities D4 ,(A;) satisfy some specific symmetry conditions. However, for arbitrary channels 
pil these results remain untrue (see [189], [200]). 
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in [11] and the review paper [195]). Since all these approaches are quite com- 
plicated, we shall not dwell upon them here and immediately pass on to the 
application of ‘nonrandom’ parity-check codes for detecting and correcting 
the transmission errors. 

Denote the individual columns of check matrix B (the ‘blocks’ of K = N— M 
digits 0 and 1) by Bo, B,,... , Buy, Bu, ..., Sy_; (in the case of a systematic 
code the last K columns bm,..., by-, obviously all contain a single 1 and 
N — M — 1 zeros each). Then, the matrix B can be set up in the form of a 
single row 


B = (Bg, 5,, erry bu-4,; bu, oe ny by-1). 


As above, denote by ¢ = (€9, &,..., @y-1) the ‘error block’ containing 1’s at 
the places of those elementary signals of a transmitted code word that are dis- 
torted during transmission. The basic equation (4) can then be written in the 
form 


CoD + e,b, +.ee + €m-,bm-; +embm +...4+ €n-15N-1 = 8, (5) 


where the addition is understood to be termwise addition (in the sense of 2- 
arithmetic) of the corresponding ‘blocks’ of length K. Thus, the ‘block’ s, which 
is obtained by replacing in the left-hand sides of equations (1’) the transmitted 
signals da, a,,..., @y_, by the received signals ao, aj,..., @n-, (and is used 
then for determining the existing errors), is equal to the sum of the columns of 
B corresponding to the signals distorted during transmission (i.e., corresponding 
to the value e, = 1, the remaining signals correspond to the value ei = 0, and 
hence the corresponding summands e,b; reduce to 0). This implies, in particular, 
that a single error (i.e., a block e containing a single 1 and N — 1 zeros) corres- 
ponds to the block s coinciding with the particular column 5, of B; the occurr- 
ence of no error, however, corresponds to a block s = 0 of N — M zeros. 
Hence, in order that a parity-check code allow us to distinguish between the 
cases of no error and all those of a single error, it is necessary that all columns 
of the corresponding check matrix B be distinct and that none of them be equal 
to 0. 

The total number of possible distinct K-term blocks b = (bm, bmi,,...; 
by-}) (i.e., distinct K-term sequences of 0’s and 1’s) is equal to the number of 
integers written in the binary number system by means of not more than K 
digits, i.e., to 2* (similar to this, the number of not more than K-digit distinct 
numbers in the decimal number system is 10*). Since a zero block (0, 0,..., 0) 
is excluded here from the number of possible columns of matrix B, the number 
of possible columns turns out to be 2X — 1. Thus, we again arrive at the con- 
clusion that a parity-check code, correcting all single errors and containing K 
‘check signals’, must be formed of code words whose length does not exceed 
2% — 1. For defining such a code it is required only to indicate the correspond- 
ing check matrix B, all of whose columns must be nonzero and distinct, 
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The codes obtained here naturally coincide with the Hamming codes referred 
to on p. 314. In the case in which N = 2~ — 1 it is appropriate to set up the 
corresponding check matrix B by choosing as its columns the binary notations 
(i.e., notations in binary number system) for all integers from 1 to 2* — 1, 
counted in ascending order. It is apparent that the code obtained here is indeed 
systematic (since it contains all possible columns of K — 1 zerosand a single 1), 
except that the ‘check signals’ here are not the last K signals but some other K 
signals. Thus, for instance, in the case in which K = 4, N = 2‘ — 1 = 15, the 
corresponding (4 x 15)-matrix B is set up appropriately in the form 


(Note that if we desire here to write all code words in a way similar to that on 
p. 311 for application to the case in which K = 3, N = 7, then we shall have to 
write 2 — 2048 fifteen-digit numbers!) For sucha matrix B the role of ‘check 
signals’ is assumed by the first, second, fourth and eighth signals (since the co- 
lumns containing three 0’s and a single 1 correspond specifically to them); the 
other 11 signals are information signals. The block s is here zero when there is 
no transmission error, and in the case of a single error is equal to the corres- 
ponding column of B, i.e., it determines directly a binary number notation of 
the elementary signal that is distorted during transmission. Hence, it is seen 
that in this case it is extremely straightforward to accomplish the decoding 
(i.e., deciphering the received signals and correcting the errors in them). 

A single-error-correcting code related to blocks of N < 2* — 1 signals is easy 
to obtain by deleting in the corresponding check martix Ba certain number of 
‘superfluous’ columns (which can be chosen arbitrarily out of those containing 
not less than two 1’s). It may also be noted that the properties of the Hamming 
code can be sharpened further by adding to each code word an auxiliary 
(K + 1)th ‘check signa!’ ay, which even allows us to detect (but not correct) all 
double errors. To do this, the only requirement is to choose the binary signal ay 
such that it yields an even number when added to all the rest of signals, i.e., it 
satisfies the relation 


Ag +a, +... + Qn-y + ay = 0. 


(It is easy to comprehend that this corresponds to adding to the check martix B 
first an additional last column of only 0’s and then an additional last row of 
N-+ 1 1’s; as a result, the number of both columns and rows of B increases by 
one.) In such a case, the absence of any error again corresponds to a block s 
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of only 0’s; in the case of a single error the first K digits of s represent the bi- 
nary notation of some integer in the range from 0 to 2% — 1, and the last digit 
Sx+1 iS unity (since the sum of all received signals is necessarily odd here); finally, 
the presence of even a single 1 among the first K elements of sand its last ele- 
ment reducing to 0 indicate the presence of a double error. The Hamming code 
thus refined is proposed also in [203]; it is sometimes called an extended Ham- 
ming code. — : 

We now pass on to codes correcting not only al! single but also all double 
errors in a block of N signals. It is clear that when there is no transmission errors 
a block s == Ba’ of K elements is composed of only 0’s: in the case of one error 
it is equal to the corresponding column of the check matrix Band in the case of 
two errors, to the sum of two corresponding columns of B (cf. relation (5) on p. 
324). In order that all of these cases be distinguished at the channel output, all 
columns of B must be nonzero, different from each other, and such that the sum 
of any two of them differs from all other columns and from the rest pairwise 
sums of the columns. Following Sacks [213] we can undertake to construct a 
matrix satisfying all these conditions by means of a simple sorting out. With 
this object, we can choose the first column 4, of Bin an arbitrary manner (but 
such that it does not consist of only 0’s). Then we take as 5, an arbitrary non- 
zero block of K digits 0 and 1 distinct from bo, as b, a nonzero block distinct 
from bo, b, and by + 4,, and as 6, a nonzero block distinct from bo, b, and bp, 
as well as from the pair sums 4) + 5,, bp + be, b, + by, and the triple sum dp + 
b, + b; (because in 2-arithmetic if by) + Bb, + 5, = by, then 


bo + by = by + bs, 


i.e., the errors in the first two code word signals cannot be distinguished here 
from the errors in the third and fourth signals), and so on. After the first i 
columns 5), b,,..., 5i-1 are so chosen, the prescription for the choice of the 
(i + 1)th column 3; is that this column 


(a) not be a zero column; 


(b) not be equal to any of the i = ( ; ) columns bg, b,,..., 5, already 
chosen; 
(c) be equal to none of the ( ; ) pairwise sum of columns already chosen; 


(d) be distinct from all ( ; ) sums of the three already chosen columns. 


i 


Obviously, the enumerated 1 + ( : ) + ( 2 )+ ( : ) conditions (a) — (d), 
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restricting the choice of column 4,, are not necessarily all distinct among them- 
selves. (Thus, for instance, for i > 5 it is possible that 


bo + b, + by = 63 + bg + 45;, 
or that 
b, + by + by = by + By.) 


However, since the number of all distinct columns (i.e., blocks of K digits 0 and 
1) is equal to 2%, hence if only 


(4) (5) = 


then conditions (a) — (d) can surely be satisfied even in the least favourable case 
in which all columns and their combinations figuring in these conditions are 
distinct. Of the relations obtained here, the most restrictive is one applied to 
the last column by_, (Since with increasing 7 the number of excluded combina- 
tions, with which a new column must not coincide, also increases). Hence, if 


only 
N-1 N-1 N-1 


K>toelt+("7")+("F')+("GZ')} (RH) 
1 2 3 

then a (K X N) check matrix B can certainly be chosen that yields a parity-check 
code correcting all single errors and all double errors in a block of N elementary 
signals. 

The inequality obtained here is the Varshamov-Gilbert inequality, which was 
given on p. 316 without proof (for the case of an arbitrary number n of errors 
corrected by our code). It is clear that in the general case of an arbitary n this 
inequality is proved exactly in the same way as in the case in which n = 2. The 
only requirement now is that each time the new column 4; must not be a null 
column, or equal to any of the previous columns, or equal to any of the sums 
of the two, three,..., 22 — 1 preceding columns. This implies the general 
Varshamov-Gilbert inequality 


reer hia ares Goa) eee Gott (#88) 


Let us again consider that n= 2. It is obvious that for small values of K 
and N it is possible to hope that all conditions imposed onthe columns of mat- 
rix B can be directly verified, thus giving a construction of single-error-correct- 
ing and double-error-correcting code. This is, in fact, the method used on 
p. 314, where for the case in which K = 4 and N = S the selection procedure 
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was used to construct a parity-check code which permits us to correct all single 
and all double errors. The check matrix corresponding to this code obviously 
has the form 


oo & 
o-em CO OC 
= OOo © 


11 
1 0 
10 
10 


(Note that for N= 5 and »=2 the Hamming inequality indicates that 
K > 4 necessarily; from the Varshamov-Gilbert inequality, however, it follows 
here that for K > 4 we can in fact construct a code correcting all single and 
double errors.) Although slightly more intricate, it is completely possible to 
verify the fact that for K = 7 and N= 10 all columns and pairwise sums of 
columns of the (7 x 10)-matrix 


1000000101 
0100000001 
0010000101 
0001000011 
00¢0100110 
0000010010 
0000001110 


are distinct from each other. Hence, the corresponding code (all of whose code 
words contain 3 information signals and 7 check signals) allows us to correct 
all single and double errors in blocks of 10 signals. (For N = 10, from the 
Hamming inequality it follows that K > 6 neccessarily, and from the Varshamov- 
Gilbert inequality it follows that for K > 8 the code we are interested in can 
certainly be constructed.) 

However, further increase in the values of K and N rapidly increases the un- 
wieldiness of the described procedure for choosing matrix B and verifying the 
validity of the requisite conditions for the columns of this matrix. For example, 
in the case of an (8 X 15)-matrix B given later on p. 335, the problem of 
carrying out all necessary checks is hardly any different. 


We shall now briefly sketch some fundamental principles of algebraic coding 
theory. This theory has played a central role in the development of general 
methods for constructing practical usable codes, which allow the detection and 
correction in a block of N signals of any number of errors not exceeding a given 
number n. So far we have considered a code as a collection of some code 
words, i.e., blocks @ = (a, @;,..., @y_;) of N digits 0 and 1 (i.e., of N elements 
of the simplest binary algebraic field). It is clear that we can also associate with 
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every code word a code polynomial of power not higher than N — 1: 
a(x) = ap + ayx + agx? +... + ay—yxN-t 


with the coefficients of our field. We may then consider a code as a collection of 
‘code polynomials’ a(x). All possible parity-check codes (i.e., all group codes) 
in such a case correspond to all possible collections of polynomials a(x) such 
that the sum of any two polynomials belonging to our collection, and also the 
‘null polynomial’? 0 =0+0xx+...+0 x x necessarily belong to the 
same collection. There is an extensive class of quite simple collections of poly- 
nomials which obviously satisfy the two indicated conditions. This class consists 
of collections of all polynomials a(x) of degree not greater than N — 1, which 
are divisible by a fixed polynomial g(x) = g+ge;x +... + gx«x* of degree 
K < N —1,i.e., can be represented in the form 


a(x) = c(x)g(x), (6) 


where g(x) is a fixed polynomial and c(x) is an arbitrary polynomial of degree 
not exceeding N— K—1. Each such collection determines a definite parity- 
check code, which we call a code generated by the polynomial g(x); g(x) itself in 
this case is called the generator polynomial, or simply generator of our code. It 
is clear that coefficients gy) and gx of the generator polynomial must be different 
from 0 for all polynomial generated codes. In fact, if g, = 0, then the first co- 
efficient dp of all the code words (6) is also equal to 0, i.e., the first signal of 
the block a = (a, a,,..., @n-,) contains no information. If gx = 0, then we 
must just consider the generator polynomial g(x) as the polynomial of degree 
K—1. 

In the case of polynomial generated codes, the bonus from the generator poly- 
nomials is a highly compact method of determining the corresponding code, 
which uniquely defines all of its characteristics (in particular, the collection of 
all code words a and the corresponding check matrix B). If an arbitrary code 
polynomial a(x) of such a code is expressed in the form 


a(x) = ay + aux +... - ag—yx® 1 + agx® + aga yx) t+... + ayxn, 


then it is apparent here that the last M = N — K coefficients ag, @x4,,--.,@y-1 
can be chosen arbitrarily, and the first K coefficients ao, a,,..., ax-, are then 
uniquely determined by the condition of the divisibility of a(x) by g(x). (Speci- 
fically, since in 2-arithmetic r(x) = —r(x), the polynomial a +aix+... 
+ agx-,x*-) must coincide with the remainder after dividing axx*® + ag 4,x*+! 
+ ++ + @y.x'-! by g(x).) From this it is seen that the last N — K signals 
ax, Ag4,.+-, GN-; in the given case correspond to information signals and the 
first K signals do, @,,..., @x--, to check signals. Hence the considered code is. 
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a block (N, N — K)-code and the total number of code words here is 2”-*. A 
block a’ = (a, a, Sete ay-,) received at the channel output corresponds to the 


polynomial 
a'(x) =a, + ax +... + Qy-1XW-1, 
which differs from the ‘transmitted polynomial’ a(x) by the ‘error polynomial’ 


e(x) = eo teyx +... + ey-yxn7, 


where, as previously, ec = a; — a; (i.e., es = 1 if the ith signal is distorted dur- 
ing transmission, and e: = 0 if it is received correctly). Due to the presence of 
an additional ‘error polynomial’ e(x), the polynomial a’(x) is in general not 
evenly divisible by g(x). The nonzero remainder r(x) resulting from the division 
of a’(x) by g(x) (obviously equal to the remainder after dividing e(x) by g(x)) is 
also an indicator of the occurrence of distortions during transmission; this re- 
mainder contains all information about errors transmitted to the receiving end. 
(The remainder r(x) is in this respect completely analogous to the block s = 
Ba’ that we dealt with in the matrix description of an arbitrary parity-check 


code.) 
The foregoing discussion shows that the collection of all detectable error 
blocks e = (€p, €1, - - » » €w~-1) can be described very easily in the case of a poly- 


nomial generated code. In fact, it follows from above-stated results that block 
e is detectable if and only if the corresponding error polynomial e(x) yields non- 
zero remainder when divided by g(x). The correction of error is also often pos- 
sible when a polynomial generated code is used. To explain this, let us first of 
all remark that two code polynomials a(x) and a,(x) of a polynomial generated 
code cannot differ in less than two coefficients. This is clear since the difference 
of two code words must be divisible by g(x) without remainder and if a(x) and 
a,(x) differ in only one coefficient, then their difference is proportional to x* 
and, therefore, it cannot be divisible by any polynomial g(x) #1 with gy 4 0. 
Moreover, if g(x) does not coincide with a divisor of the polynomial of the 
form x2 — 1, where L < N, then two code polynomials a(x) and a,(x) cannot 
differ in less than three coefficients. Jn fact, if a(x) and a,(x) differin only two 


coefficients, then 


a(x) — a,(x) = xt — xf = x(x*-? — 1), whereORJ<igN—1 


(let us remind that x* + x’ = x‘ — x/ in 2-arithmetic). The last relation shows 
that g(x) must be a divisor of x? —1. Therefore, if g(x) is not a divisor of 
any polynomial x* — 1,L =1,2,..., N—1, then the smallest number d of 
distinct elementary signals is not less than 3 for any pair of code words, i.e., 
the code permits to correct any single error (see p. 309). Similarly, any two code 
polynomials a(x) and a,(x) will differ in more than three coefficients, if and only 
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if generator polynomial g(x) is not a divisor of any polynomial of the form 
either x# + xf = x? — xf = xi(x?-4 — 1), or x§ + x4 + x*, 


and so on. 

In algebraic coding theory, attention is mainly focused not on general parity- 
check codes, nor even on more special arbitrary polynomial generated codes, 
but on some particular classes of such polynomial generated codes, having a 
singularly simple algebraic structure that appreciably facilitates obtaining a prac- 
tically convenient ccding and decoding procedure. Of these particular classes, 
the most important one is that of cyclic codes. A parity-check code is called 
cyclic, if for each of its code words a = (dp, a, d2,..., Ay-;), a block (ay-4, 
M, 4, .. . , @y-y), which is obtained by shifting a cyclically, is also a code word. 
It is clear that in such a case a block (@y-:, ay-it1,..., @y-i-1) Obtained by 
performing an j-multiple ‘cyclic shifting’ of @ is also a code word for every 
tS 15 234 4g VS 1 

An important property of cyclic codes is that they are all generated by poly- 
nomials and it is quite simple to characterize the class of generating polynomials 
g(x) corresponding tothem. In fact, let us first assume that we are concerned 
with the code generated by the polynomial g(x) (i.e., with a collection of code 
polynomials a(x) of the form (6)). Suppose that 


a,(x) = ay-1 + ax + ax? +... + ay-yxN 
is a polynomial corresponding to a shifted block (ay_,, do, @,..., Gy—2). 
Since 
a,(x) = x(aq + ayx +... + Ay-yx4) — ay-y(x% — 1) 
= xa(x) — ay-(x% — 1), (7) 
where, as usual, a(x) = a,+ a,x +... + ay-,xN-1, it is clear that for the 


general case in which ay-, # 0, a,(x) is a code polynomial simultaneously with 
a(x) (i.e., is evenly divisible by g(x)) if and only if g(x) is a factor of x” — 1.T 
Thus, a code generated by a polynomial g(x) is cyclic in that (and only that) case 
in which g(x) is a factor of the polynomial xN — 1. 

Consider now an absolutely arbitrary cyclic code, and suppose that a(x) is a 
code polynomial corresponding to it. Then, from equation (7) it follows directly 
that, together with a(x), in the collection of code polynomials of our code there 
necessarily occurs also a remainder after dividing the polynomial xa(x) by 
x" — 1. But then it is clear that in the collection of code polynomials there are 
also remainders after dividing the polynomials x x xa(x) = x®a(x),x xX x%a(x)= 
x3a(x),...by xv — 1, ie., remainders after dividing all possible products 


tSuch polynomials g(x) are called cyclotomic in algebra; the case for which the coefficients 
of g(x) are ordinary real numbers was studied extensively by the great German mathematician 
Carl Friedrich Gauss at the turn of the ninetcenth century. 


332 4. APPLICATION OF INFORMATION THEORY 


x"a(x) by x” — 1, where 1 is any nonnegative integer. Since, furthermore, the 
sum of any code polynomials is always a code polynomial also, it follows from 
our assertions that, together with a(x), all remainders after dividing polynomials 
of the form b(x)a(x) by x" — 1 are also code polynomials, where b(x) = 69 + 
bx +... + bax” is an arbitrary polynomial with coefficients from our two-ele- 
ment field (i.e., either 0 or 1). 

A collection of all possible polynomials of degree not greater than N — 1 can 
be considered as a collection of all possible remainders resulting from the divi- 
sion of polynomials of any degree by x¥ — 1. Then, the property enunciated 
above of a collection of code polynomials a(x) of an arbitrary cyclic code can 
be stated as follows in the language of general algebra: such a collection of code 
polynomials is an ideal in a set of all remainders after dividing the polynomials 
by x¥ — 1 (see Appendix II, where a general definition of an ideal is given, 
and a particular case of this notion required by us is also considered). In the 
following, we shall not use the general definition of an ideal; we need only the 
following straightforward algebraic theorem (which the reader, if desired, may 
accept on faith, but may also acquaint himself with its proof from Appendix II): 
any ideal in a set of remainders from the division of arbitrary polynomials by 
some fixed polynomial f(x) of degree N coincides with a collection of polynomials 
of the form c(x)g(x), where g(x) is some fixed factor of the polynomial f(x) and 
the degree of c(x)g(x) is not greater than N— 1. This algebraic theorem evi- 
dently implies that every cyclic code is generated by some factor g(x) of the poly- 
nomial xX — 1, 

Suppose that g(x) is a factor of xn — 1 and hence 


x® — 1 = g(x)h(x). 


In such a case it is easy to show that the code polynomials of a cyclic code with 
generator polynomial g(x) are such polynomials a(x) of degree not exceeding 
N — 1, for which a(x)h(x) is evenly divisible by x” —1. In fact, if a(x) = 
c(x)g(x), then it is obvious that 


a(x)h(x) = c(x)g(x)h(x) = c(x)(x" —1) 
is evenly divisible by x” — 1; conversely, if a(x)h(x) = b(x)(x" — 1) is evenly 
divisible by x” — 1, then it is clear that a(x) = b(x)g(x). The indicated charac- 
teristic of a(x) greatly facilitates to check the occurrence of transmission errors: if 


a'(x) = a(x) + e(x), 


where e(x) 4 0, then a’(x)h(x) is in general not divisible by x” — 1. It is also 
easy to see that all information concerning the occurrence of errors (i.e., about 
the polynomial e(x)), available at the channel output is contained in the remain- 
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der from the division of a’(x)h(x) by x¥ — 1. (Note that the division of an 
arbitrary polynomial d(x) by x” — 1 is most easy to perform : for this the only 
requirement is to replace in d(x) all powers x”, where M 2 N, by the power 
x™, where m is the remainder after dividing M by N.) Hence, in decoding a 
cyclic code, a vital role is played by the polynomial h(x), which we agree to call 
the check polynomial of a cyclic code. In fact, the polynomial a‘(x) received at 
the channel output should first be multiplied by the check polynomial h(x), and 
then the remainder resulting from the division of this product by x” — 1 uni- 
quely determines the deciphering of the received message (i.e., the choice of the 
‘most probable error polynomial’ e(x)). 

Cyclic codes form a special sub-class of parity-check codes, whose general 
characteristics have so far not been studied to a great extent. Thus, for instance, 
if we confine ourselves to the use of cyclic codes alone, then it is not known 
that we may or may not attain information transmission over the simplest binary 
symmetric channel at a given rate less than C = Lec bits/unit time and with 
error probability as small as desired. Moreover, even it is not known whether 
or not the transmission can be effected at least at a rate different from zero and 
with error probability as small as desired.t However, the great advantage of 
cyclic codes lies in the fact that here we may develop some relatively uncom- 
plicated algebraic decoding methods, which allow us in many cases to accom- 
plish this decoding in relatively short time (see, for example, references cited on 
pp. 305-306 and also rather the advanced book [207], especially devoted to this 
problem). 

The application of cyclic codes is exceptionally fruitful for correcting all errors 
whose number does not exceed a given nin an N-term block. According to the 
foregoing discussion related to arbitrary polynomial generated codes, in order 
that it be possible to correct all single errors by using a cyclic code generated by 
the polynomial g(x), the only requirement is that none of the binomials 


xi — x! = x(xI-1 — 1), where i < N,j < Nandj >i, 


be divisible by g(x). The polynomials g(x) with the prescribed properties, which 
are factors of x” — I (i.e., correspond to cyclic codes), always exist and have 
been well studied for all N= 2* —1. Hence all the Hamming codes with N 
= 2K — | can easily be put into the form of cyclic codes. In the particular, it is 
easy to verify that for the case in which K = 3, N =7 (considered on pp. 
310-311) the genierator polynomial g(x) and the check polynomial h(x) can be 


tRecall, as remarked on pp. 273-274, that until the appearance of Shannon’s work [21] the 
impossiblity of such transmission looked probable even for the case in which quite arbitrary 
codes are used. It is now known that for arbitrary codes the situation is entirely different, and 
the same is true for general parity-check codes. However, in relation to more special cycli¢ 
¢odes alone the impossibility indicated has not yet been ruled out. 
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chosen in the form 


e(x)=84+x41, Ax) H=xtt+xe?+xti 


(direct multiplication shows that g(x)h(x) = x? — 1, as it should). Moreover, 
for the case in which K = 4, N = 15 (considered on p. 325), it is possible to 
set 


B(x) = x4 +x+1, A(x) = x24 x84 x74 354 84 24x41 


(here g(x)h(x) = x'5 — 1). 

Analogously, for double-error-correcting codes that permit to correct all single 
and double errors, all monomials x‘, binomials x‘ + x’, trinomials xf + x’ + x* 
and quadrinomials x‘ + x’ + x* + x', where i, j, k, 1< N, when divided by 
g(x) must yield distinct remainders, and so on. It is clear that the problems 
arising here are specifically algebraic in their character; their solution turns out 
to be sufficiently involved, however. 

A general method for the construction of cyclic codes, capable of correcting 
any number of errors less than 7 in a block of length N = 2¥ — 1 and having 
a check matrix with “K rows and N columns (i.e., containing not more than nK 
check signals in a block of N = 2* — 1 signalst), was indicated only in 1959 by 
Hocquenghem [204] and independently by Bose and Chaudhuri [192] in 1960.7 
The Bose-Chaudhuri-Hocquenghem construction is not very complicated, but it 
is based on some relatively advanced algebraic concepts and results. These 
concepts and results can be found in Appendix II at the end of the book and 
the construction indicated will be described at the end of this section. The 
reader may well skip over this matter in case he is not interested in this algebraic 
construction.) Here we restrict ourselves to two examples of the error-correct- 
ing Bose-Chaudhuri-Hocquenghem codes. Both these examples relate to the 
case, in which K= 4, N= 24— 1 = 15. The Hamming code correcting all 
Single errors, which corresponds to these values of K and N, is defined by the 
check matrix written on p. 325. In the case of a double-error-correcting code, 
the check matrix can be represented as an (8 X 15)-matrix of the form 


tSince the corresponding code is not systematic, hence from the fact that the check matrix 
contains nK rows it can only be inferred that the actual number of check signals here is not 
greater than wK (see p. 319). 

+ttGenerally speaking, besides the simplest (the so called primitive) Bose-Chaudhuri-Hoc- 
quenghem codes correcting a given number of errors in a block of N = 2K — I signals, there 
exist also non-primitive codes of the same type, for which the block-length N is an odd num- 
ber not representable in the form 2K — 1. We shall not consider these last ¢odes in this book 
(see, however, footnotef on p, 341), 
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This matrix is quite cumbersome; hence it is much more convenient to set the 
corresponding code with the aid of its generator polynomial 


g(x) = t+ x4+ IGxt t+ x8 4+ 242x421) 
Sb a poe a ae 


or its check polynomial 


A(x) = (x + IQ + x 4+ It + x84 1) 
= x7 + x6+ x4+1 


(it is easy to verify that, indeed, g(x)h(x) = x — 1). Note that the code under 
consideration consists of code words of length 15, involving 7 information and 
8 check signals. By virtue of the Hamming inequality (*) on p. 315, we can 
say that for N = 15 a code correcting all single errors and all double errors 
cannot contain less than 7 check signals; here the Varshamov-Gilbert inequality 
(***) on p. 327 shows that such a code can certainly be constructed if K = 9. 

If we now wish to construct a code, correcting in a block of 15 signals all 
single, double and triple errors, then the check matrix of such a Bose-Chaudhuri- 
Hocquenghem code has 3K = 12 rows (and, as previously, 15 columns), The 
generator polynomial of the code we are interested in assumes the relatively 
simple form 


g(x) = ® +x + Dott xt Dot 4+ x84 x7 +%4+ 1) 
= x4 xO ht xt t+ x?+x4+1, 


and its check polynomial is given by 
A(x) = (x + IOxt+ P+ 1 =x +24 x41 


(here again we have g(x)h(x) = x¥® — 1). The (12 x 15)-check matrix of our 
code is 
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Note that although this matrix has 12 rows, the number of ‘check signals’ asso- 
ciated with the corresponding code is 10. This is immediate from the fact that 
the generator polynomial g(x) is in this case a polynomial to the power 10.f Thus, 
in ths use of the code under consideration every set of five ‘information signals’ 
is supplemented by ten ‘check signals’. It is only then that in a sequence of 15 
signals received at the channel output it is possible to detect and correct with- 
out exception all single, double and triple errors. It is also routine to see that 
the correction of all such errors in a block of 15 signals can in no way be 
achieved using less than 10 ‘check signals’. The fact is immediate from the 
Hamming inequality (the Varshamov-Gilbert inequality here shows that the 
code we require can surely be constructed if 12 or more ‘check signals’ are em- 
ployed). 

The data on the number of ‘information’ and ‘check’ signals for many distinct 
Bose-Chaudhuri-Hocquenghem codes can be found in various books on cod- 
ing and information theory (see, for instance [212, Chapter 9], [190, Chapters 
7 and 12]). By the results deduced in [212] all codes of this type with N < 15 
and even those with N arbitrary but m = 2 are optimal in the sense that there 
does not exist a code with the same length of N ‘blocks’ and the same total 
number of code words S (i.e., with the same information rate v = (L/N) log S 
bits/unit time), leading to the lower error probability when it is used for trans- 
mission over a binary symmetric channel (see p. 342). When N = 1023 
(= 21° — 1) the number of ‘check signals’ for different values of 1 turns out to 
be quite close to the corresponding Varshamov-Gilbert bound. But for still larger 
N this must approach more closely to the Hamming lower bound and not to the 


+The same conclusion in the considered case can be drawn by starting from the very form 
of the check matrix. Since its third row from the bottom consists of 0’s alone and the two 
rows following it are identical, it is clear that the code is not affected if of the last three rows 
we retain only one (the last or penultimate) row. 
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Varshamov-Gilbert upper bound. In fact, if we use the upper bound of the 
binomial coefficient ( _ ) given by the inequality (**) on p. 165 and a similar 
lower bound of these coefficients (or even simply substitute into the exact 
formula ( . ) = N!/n\(N — n)! the approximate values of the factorials N! and 


(N — n)! for large N, available in many advanced mathematics texts), it is rout- 
ine to show that for very large N the general Hamming inequality assumes the 
form 


2k > AN", i., K2nlog N+ Ay, 


where K is the number of check signals, m the maximum number of errors to be 
corrected, and A and A, = log A are some numbers (A is positive, but A, may 
possibly be negative) depending on n but not on N. Similarly, the Varshamov- 
Gilbert inequality in the case of large N allows us to conclude that if 


2% > BN", ic., K>2nlog N+ B,, 


where B and B, = log B are other numbers depending on x (but not on N), 
then there does exist a code that enables us to correct any number of transmis- 
sion errors not exceeding n in a block of N signals. In the case of Bose- 
Chaudhuri-Hocquenghem codes with N = 2*1 — 1 (so that K, = log N) the 
number K of check signals, as indicated above, does not exceed nK, = n log N; 
hence for large values of N the number of check signals in these codes is always 
close to the corresponding Hamming lower bound. In this sense, these codes 
are close to the best possible codes with regard to their capability to correct a 
given fixed number of errors in very lengthy blocks. 

Obviously, the choice of quite lengthy code words (i.e., extremely large N) is 
not advantageous if the codes correct only a fixed number 7 of errors, since 
with increasing N the probability of the emergence of errors more than in a 
block of length N sharply increases. Henoe, when N increases it is natural for 
the value of 1 also to increase simultaneously. However, if 1 increases propor- 
tionally to N, then with the increase of N, as has been shown, the information 
transmission rate decreases at the same time (see, for example, [212], Chap. 9). 
The most important problem, however, is not that of the optimal choice of the 
values of N and n but that of the method of decoding the obtained codes when 
N is large; specifically, the difficulty in decoding is the foremost constraint that 
restricts the opportunities for the choice of code parameters that will ensure 
both a low probability of error and a high transmission rate. In relation to 
Bose-Chaudhuri-Hocquenghem codes a whole series of special decoding methods 
have been developed that allow one to accomplish it effectively up to a length 
N of code words of an order of many hundred or even a few thousands. It is, 
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however, not possible here to dwell upon these methods; in this connection, we 
can only refer the reader to other (sufficiently advanced) works on information 
theory and coding referred to on pp. 305-306 (sce also [198] and [207]. Several 
other interesting and practically useful codes have also been described in these 
works, but these have not been considered in the present text. 


As in the foregoing, we shall consider only the case of a binary communication channel (us- 
ing two e!ementary signals), and a code shall be understood as some collection of code words, 
i.e., sequences a = (a, G,,..., @y_,) of N digits O and 1. In the study of the error-correcting 
codes, an important role is played by the Hamming distance | b — a |x, between two sequences 
5 = (by, b,,..., by_1) and a = (ay, a,,.. . , @y_1), which by definition is equal to the number 
of digits a; such that b; ~ a; (i.e., the number of 1’s between the difference b; — a;, under- 
stood in the sense of 2-arithmetic). The Hamming distance shares many characteristics of the 
usual geometric distance (see, for example, Appendix II.) It coincides with the number of 
distortions of individual signals to be transmitted, leading to the result that the transmitted 
sequence a is received as the sequence b at the channel output. Clearly, the larger the Ham- 
ming distance between individual code words, the smaller is the probability of confusing them 
at the receiving end, i.e., other conditions remaining the same, the better is the code to be used. 
Hence an important characteristic of a code is the code distance 


D= min | a@ — a( |x 


associated with it, the Hamming distance between the ‘closest’ distinct code words of a given 
code. It is apparent that in the case of a code that allows us to correct any number of errors 
not exceeding n, the Hamming distance between all pairs of code words a(t) and a(3) must be 
greater than 2n (see p. 309). This implies that D > 2n + ! here, D being the code distance of 
our code. Conversely, if D > 2n + 1, then by agreeing to decode as code word a@) any re- 
ceived sequence 6, belonging to the Hamming sphere of radius.n with centre a(®) (i.e,, all b 
such that | b ~ a(¢) | <n), we are sure to correct any number of transmission errors not 
greater than n. Thus, a code is capable of correcting any number of transmission errors not 
greater than n if and only if its code distance D is not less than2n +1. Similarly, it is easy to 
show that if the code distance D is not less than 2n, then the code allows us to correct any num- 
ber of errors not exceeding n —1 and, in addition, to detect the occurrence of n errors (but in 
the latter case it may not also be possible to correct precisely these n errors).¢ 

It is clear that the ‘volume’ Vn of the Hamming sphere of radius a, i.e., the numter of 
‘points’ b = (by, by,..., by_1) belonging to this sphere with centre at an arbitrary ‘point’ 
a = (4, G, . . . » @y_1) iS defined by the formula 


re (Te(Dent() 


Since the total number of all N-term sequences is 2N, it is immediate that the number S of 
distinct code words of length N appearing in a code that allows us to correct any number 


{It ought to be borne in mind that the code distance D does not define thé total capability 
of the code to correct transmission errors. Thus, say, if D = 2n, then frequently for many 
(but not for all) transmitted words a‘) the codes nonetheless permit us to correct transmission 
errors even when they considerably exceed n errors. 
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of errors not exceeding n must satisfy the condition 


S< an 
1+4(7)+--4+(4) 


This simple condition, giving an upper bound on the possible number S of code words (and 
hence also the maximum possible information transmission rate v = (L/N) log S bits/unit 
time), is called the Hamming upper bound on the number of code words. In the particular case 
of parity-check codes (i.e., differently, linear or group codes), it coincides with the lower Ham- 
ming bound on the number of check signals considered on p. 315: in fact, for an (N, M)- 
parity-check code the number S of code words is given by 


(8) 


2N 
M _. 
2 = OK? 


and hence condition (8) coincides here exactly with the Hamming inequality. Note, however, 
that condition (8), in contrast to the Hamming inequality for the number K, applies to any 
code and not to parity-check codes alone. 

A code having the property that the left-hand and right-hand sides of (8) coincide with 
each other is called a perfect code (or, sometimes, a densely packed code). Perfect codes are 
remarkable because in practically all respects they are optimal (i.e., the best). It is seen, for 
example, that among codes of a given length N, correcting a given number a of errors, the 
largest number S of code words (i.e., the largest information transmission rate) corresponds to: 
perfect codes, Moreover, in the case of perfect parity-check codes correcting a specified num- 
ber of errors, the number of check signals K is the least possible. Now assume that our code is’ 
employed for the transmission of information over a binary symmetric channel; here an extre-, 
mely important characteristic of the quality of transmission is the mean probability of decoding 
error : 


Q= Q.+Q+...+Qs 
° Ss 


- where S is the total number of code words of the code and Q; is the probability that the trans- 
mitted ith code word a‘) will be decoded in error at the channel output. Now suppose that 
m'*) is the number of sequences b, which are at Hamming distance & apart from the ith code 
word a(‘) and are to be decoded as a!) at channel output. Since tn the case of the transmis- 
sion of a sequence a() through a binary symmetric channel the probability of receiving any 
such sequence 6 is p*(1 — p)N~*, the probability of decoding accurately the transmitted. 
sequence a(#) is the sum ; 


m((1 — py + m{p(t = p)N-? +... + mDpe(L — pyn-k +... 
Hence, it is seen that the mean probability of decoding error is 
1 
Q=1— > [m1 — p)N + mp(l — PN? +... + meph( 1 — p)N-F +... ], 


where mt, = mi") + m(?) +... + mS) is the tolal number of sequences 5 at the Hamming 
distance k away from some code word a‘) and to be decoded as this a(*) (so that 


Mgt m+... + My +... = 2). 
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But the total number of sequences of length Nat a given Hamming distance k& from a fixed 


sequence a(?) is ( : ) Hence, for a code consisting of S code words of length N, we have 


my < Sim <S( 7 )s-m <S( gee : 


Suppose now that z is the largest integer such that 


s+s(7)+...45( 7) <2, 


but 


s45() + 45(%) 15M )>20 


so that 


w—(sts(T)+..4+5(Y))=r<s(, 7, ). 


Then, if m = S,m, = s(7) peealg My = s( ae we have m,.,<T. As usual, it is 
assumed here that p < 4; then, the probability p*(1 — p)N—* decreases with the increase of k 
and hence the case in which m, = S, m = s( i: 12 My= s( - Mn. =T, is the 


most favourable, i.e., gives the /east mean error Q. Consequently, 
N ae 2 
Q>1—|G-py+{ , ) ra — p)N 


$24 (NY pea — pwn 4% posta — pwr]. 0) 


The estimate (9) of the least possible mean probability of decoding error for a code with 
fixed values of N and S (used for transmission over a binary symmetric channel with a given 
value of the probability p of the distortion of signals) is called the Hamming lower bound on 
the mean probability of error. For the perfect codes with decoding rule stating that all received 
N-term sequences separated from a code word a‘) by a Hamming distance not exceeding n 
are decoded as a(#), inequality (9) obviously turns into an equality (T being equivalent to 0 
here). Hence, it is seen that for such codes the mean probability of error is smaller than for any 
other code with the same valves of Nand S. 

Perfect codes have an extremely simple gcometric meaning (in geometry that uses the Ham- 
ming distance instead of the usual distance) : they correspond to the cases in which a collection 
of all possible ‘points’ b = (by, b,,..., by_) can be partitioned into a finite number of ‘Ham- 
ming spheres’ of a certain radius m, mutually disjoint but filling in their totality the entire 
‘space’ (consisting of 24 points), and the centres of these ‘spheres’ are taken as code words 
(hence the name ‘densely packed code’). Their main deficiency is that only a few such codes 
are available, existing only for certain exclusive values of N and S. The s‘mplest perfect 
egde is a trivial code consisting of two code words (0, 0,..., 0) and (i, 1,...,1), each of 
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which is composed of an odd number N = 2n + 1 of the same digits. For such a code, obviously 
D =2n+ 1, and the code allows one to correct a or fewer errors; here the whole space of 
2N = 22%+1 points is decomposed into two Hamming spheres of radius a (each containing 
22” — 2N-1 points). In addition to this, there is a non-trivial (and highly important) class of 
perfect codes formed by the Hamming codes with N = 2K —1, M=2K —K—1. In fact, it 
has already been remarked on p. 315 that the Hamming inequality for the number of ‘check 
signals’ (which is equivalent to inequalily (8)) becomes an equality for these codes; hence they 
are perfect. In the case of Hamming perfect codes, the whole space of 2N = 22K-1 points is- 
decomposed into 228—k-1 Hamming spheres of radius 1, each of which contains 2K points; 
here D = 3 and, consequently, all single errors can be corrected. But, if it is just assumed that 
n> 1andS > 2, then we immediately encounter the foremost difficulty that for the existeace 


of a perfect code the sum 1 + ( :) ( ») by (8) must be equal to some integral 


power of 2 which in reality is seldom achieved. The American scientist M. J. E. Golay, in 
his search for perfect codes, noticed that 


23 23 23) _ 2 ee 
1+(4)+(4)+(G ) = 208 =, 
and this suggested that in principle there may exist a perfect code with 


213 
N = 23 and S =; = 2! «= 4096, 


capable of correcting any combination of three or fewer errors. He indeed Succeeded in find- 
ing such a code (since called the Golay perfect binary code). The code turns out to be a (23, 12)- 
cyclic parity-check code, defined by the generator polynomial 

B(X) = + Pf x7 + x8 + XS + x41, 


or by the check polynomial 


a3 ; 
m= SF = Hp MO PT + ht Ot P+ HHI 


and having code distance D = 7.t Subsequent searches for new perfect codes have not been 
fruitful; except for those enumerated above, no other such codes have been found.ff This 


The Golay code turned out to coincide also with the (nonprimitive) Bose-Chaudhuri- 
Hocqenghem code corresponding to the values N= 23 and n = 2 (i.e., correcting all single 
and double errors). However, the construction of this code by the method due to Bose et ai., 
allows one to assert only that for this D > 5 (expressly this also means that it is a double- 
error correcting code), whereas Golay claimed that in fact here D = 7. 

ttThis apparently does not imply that there do not exist other sums of the form 


(ent) 


equal to a power of two. Thus, for instance, it is easy to verify that 


(24 (B= 


but it can be nevertheless proved that there exists no perfect code with N= 90 anda — 2, 
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makes many scientists to infer that there are no perfect codes excepting those enumerated 
above. Attempts to prove rigorously that there are no new perfect codes succeeded in the early 
seventies due to Tietavadinen and Perko [217] in Finland and Zinovievy and Leontiev [220] in 
USSR. The initial results of these authors were related to perfect binary codes (i.e., codes of 
messages represented by sequences of two elementary signals). However, later Zinoviev and 
Leontiev [221] and Tietavainen [216] independently obtained a complete solution of the prob- 
lem of finding all’perfect codes that employ p* elementary signals, where p is an arbitrary prime 
and k is any positive integer. It turns out that such general perfect codes are quite rare, too. 

Since perfect codes are so scarce, much attention has been devoted to the search for so- 
called quasi-perfect codes, slightly inferior to perfect codes but nevertheless sufficiently good. 
Quasi-perfect codes are defined as such codes that Hamming spheres of a certain fixed radius 
n with centres at the points corresponding to all possible code words fill out the entire space 


of 2% points 5, with the exception of only some T < s( a 1 ) points (where S is the number 


of code words for the code) located at Hamming distance n + 1 from at least one (but may be 
also from several) code word. If we agree in the case of quasi-perfect codes to decipher as 
a(®) all the received sequences 5 separated from the code word a(*) by not more than Hamming 
distance n and to decipher the sequence b separated by a distance n + i from the code word 
closest to it as one (it is immaterial which) of the code words separated from b by a distance 
n + 1, then inequality (9) also becomes an equality here. Hence, even for quasi-perfect codes 
used for transmission over a binary symmetric channel, the mean probability of decoding error 
is less than for any other code with the same values of Nand S. At the same time, quasi-perfect 
codes exist in greater number than perfect codes (even though they are not many). Thus, for 
instance, the codes correcting all single errors in a block of N 5 2K — | digits and obtained 
by omitting a certain number of columns in a check matrix corresponding to the Hamming 
perfect code with N = 2K — 1 quite often turn out to be quasi-perfect (see, for example [202], 
Chapter 5). The (primitive) double-error-correcting Bose-Chaudhuri-Hocquenghem codes 
with N = 2K — 1, considered on pp. 334-336, are also all quasi-perfect (see, e.g., [202]); it is 
specifically on this basis that such codes were affirmed on p. 336 to be necessarily optimal. A 
series of other examples of quasi-perfect codes is described in Chapter 5 of [212]; we shall 
not further elaborate on this here. 


Finally, let us describe the general construction of the binary Bose-Chaudhuri-Hocquenghem 
codes that have been mentioned repeatedly in this section. The basis of this contruction is 
an ingenious description of the code generator polynomial by determining its roots, i.e., the 
solutions of the equation g(x) = 0. The main difficulty in determining the roots of g(x) is easy 
to understand if we remember that the roots of ordinary polynomial with real coefficients may 
not be compulsorily real numbers, but may telong to a wider (i.e., the one containing a field 
of real numbers as its part) field of complex numbers. Quite similarly the roots of polynomial 
g(x) with coefficients from a finite field may belong to the extension over a given field, i.e., to 
a new field containing the primary finite field as its part. In particular, when coefficients g(x) 
are the elements of 2-arithmeric (a field F, = {0, 1} with two elements O and 1), the roots of 
g(x) may belong to a finite field Fm with 2™ distinct elements, where m > 1; see Appendix II. 
(As explained in Appendix II, the field F,™ is nothing but the collection of all polynomials 
Gy +48 +...+4,,-18%-1, where ay, G1, ...,Gm_1 are elements of F, = {0, 1} and 8 is a root 
of irreducible polynomial P,,(x) of degree m with all coefficients equal to O0-or 1. Another 
equivalent representation of F,m is given by P,,(x)-arithmetic, i.e., the field of all the remain- 
ders after division of arbitrary polynomials by P,,(x).) 

Our problem is to find a generator polynomial g(x) such that any pair of code polynomials 
a(x) and a,(x) of the.corresponding polynomial generator code has more than d distinct co- 
efficients, where d = 2n is a given integer (and n is the maximal number of errors to be cortect- 


4.5. BRROR-DETECTING AND ERROR-CORRECTING CODES 343 


ed in a code word; see p. 309). Let us first choose an integer r such that 27 — 1 > d@. Consider 
a finite field F,r of order 27 constructed with the aid of an irreducible polynomial P,(x) of 
degree r with coefficients from F, = {0,1}. Let « be a primitive element of Fir, i.e., all the 
consecutive powers a! = a, a, a, ..., a2/-! = I be distinct (see Appendix II, where it is also 
indicated that any r + 1 elements of the field F,7 are linearly dependent, i.e., the sum of some 
of these r + 1 elements is equal to zero; recall that all coefficients .; of equation (7) of Appen- 
dix II are equal to 0 or 1 in our case). Let us consider the collection of elements 


a®=],a4=a4, a8, ..., 01, 
where n, is chosen subject to the condition that the elements 1, «, a®,..., a2 are linearly 


dependent but the elements 1, «, «?,..., «"1—1are not (it is clear that necessarily a, < r + 1). 
The corresponding linear dependence has the form 


A agit bs ge Se, (10) 
where 
O<iM< WY <...<if=m<r+l 
gc 7) ji) if) : . 
are certain integers. Here x'1 +x2 +...+x% =O is an equation of the lowest degree 


with the coefficients from F, having the root «. Accordingly, the polynomial 


i) if) 7M) 
M(x) =x +27 +...4 95 


may be called a minimal polynomial of the element a, 
Consider now a sequence of consecutive powers of «* (i.e., 1, «, a4, «8, ...) and let m, be 
the smallest number of the first terms of this sequence which are linearly dependent. Then 


(aye y aya. 4 (alt =o, a) 


where he =n,. Here M.(x) = gt + sia a eee x's is evidently the equation of the 
lowest degree having the root «? and hence M,(x) is the minimal polynomial of the element 
«3, (It will be shown later that M,(x) coincides with M,(x); however, this fact is immaterial 
here.) Similarly, we may consider the sequence 1, «®, «®,...and form an equation of the 
lowest degree having the root «® and the corresponding minimal polynomial M,(x). We con- 
tinue to apply this procedure to a4, x5,..., «4. 

Let us now consider the polynomial 


g(x) = lem [M,(x), Ma(x),. .. , Ma()], (12) 


where lcm symbolizes the least common multiple of the polynomials in square brackets. It 
is possible.to show that if N = 27 — 1, then g(x) is just the desired generator polynomial. For 
this it is necessary to show that, if g(x). is given by (12), then any two polynomials of the form 
(6) (see p. 329) of degree less than 27 —1 will have at.least d + i distinct coefficients. "Since 


344 4. APPLICATION OF INFORMATION THEORY 


the difference of two polynomials of the form (6) has also the form (6), it is sufficient to show 
that any polynomial a(x) = ¢(x)g(x) of degree less than 2r — 1 has at least d +1 non-zero (i.e., 
equal to 1) coefficients. 

The proof of the statement in italics is rather easy. It is clear that the polynomial g(x) has 
the roots «, a, «,..., #4 and that it is the polynomial of the lowest degree having all these 
roots (Since M,(x), M,(x),..., Ma(x) are the polynomials of the lowest degree having the 
roots w, a,..., a4, respectively). Hence any code polynomial a(x) = ¢(x)g(x) has also the 
roots a, a?,..., a4. Suppose now that 


a(x) = c(x)g(x) = x xia... + xds, (13) 


where the number of terms in the right-hand side is not greater than d. Let us substitute the 
roots a, a2,..., a4 in the right-hand side of (13) in place of x. Since (at)im = (aim)*, we 
obtain a system of relations 


ai tavet... t+ ass = 0, 
(ahs)? + (ai)? +... 4 (ais)? =0, (14) 


(a41)4 4 (ase)d +... + (afs)\d = 0. 


Here 0< jj, <jp <... <j, < 27 —1 and, therefore, all the elements a@/:, af:,...,a45 are 
distinct (since «is a primitive element of the field Fyr). However, s < d and, therefore, already 
the first s of relations (14) are contradictory (see Appendix II, p. 374). This proves the statement 
formulated above. 

Let us also note that «, «?,...,a¢are nonzero elements of the field Fy,r of 27 elements. 
Therefore, the order of all these elements is a divisor of N= 2" — 1 (see Appendix II). This 
implies that all elements.«, «,..., a4 are roots of the equation xN —1=0. Since M,(x), 
M,(x),..., Mg(x) are polynomials of the lowest degree having the roots «, «?,..., «4, res- 
pectively, it is easy.to show that the polynomial (12) is a divisor of the polynomial xN — 1, 
N=2r—1. Hence, the (N, M)-code generated by g(x), where M = N — Kand Kis a degree 
of the polynomial g(x), is a cyclic code. This code is just the Bose-Chaudhuri-Hocquenghem 
(N, M)-code. : : 

Now let us consider a few examples. Suppose that we are looking for a code that corrects 
all the single and all the double errors. Here any two of the corresponding code polynomials 
must differ in five or more coefficients, i.e., d= 4. Therefore, we can select r = 3 and start 
from the field Fg of 2? = 8 elements. Such a field is given, for example by (x? + x + 1)-arith- 
metic, i.e., by a collection of all the remainders from a division of polynomials with coefficients 
from F, = {0, 1} by Q(x) = x3 + x + 1, or (what is equivalent) by a collection of the trinoms 
Q, + a, + a,«7, where a,, a, and a, are the elements of F, and « is a root of the equation 
x34.%+4+1=0, (Note that the polynomial Q(x) = x® + x + 1 is clearly irreducible over F, 
since both elements of F, are not its root and, therefore, none of the first degree polynomials 
over F, is a factor of Q(x).) 

Consider Fg, as a collection of the trinoms a, + a,% + a,a?. The element « of this field is 
here a primitive element, since evidently 


@e-aazPM@o—atlef=aetaed=at?tagy 1,8 = a2? + 1,07 = 1 (15) 


are all nonzero elements of this field. (Equations (15), of course, make use of the identity 
«a? + «+ 1 =0 and the laws of 2-arithmetic.) The minimal polynomial of « is equal to 
M,(x) = x? +x-+1 by the very definition of a Moreover, (a+ 6+ ¢c)§=a* + 68 +c? 
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over the field F, (see Appendix II). Therefore, 
(a? a $1)? = (a9)? + a? + 1 = (a2)? + a? + 1, 


and hence M,(«?) = 0, i.e., minimal polynomial M,(x) of «? coincides with M(x). Quite 
similarly we can show that [M,‘a?)]? = M,(«4) and hence Af4(x) = M,(x).f 

Equations (15) imply, in particular, that a9 = «? and that 2®4+ 48+1=0. Therefore, 
(a)® + (a)? + 1 = 0 and hence M,(x) = x® + x? + 1 (it is easy to show that 1, «> and «® are 
linearly independent). 

Now, we obtain 


g(x) = lem [M,(x), Ma(x), Mg(x), Mg] = Mi) Mg (x) = (8 + x + 1x? + x7 +1) 
= x64 x54 x44 734 x24 44 1. 


Since N = 23— 1=7 in our case, this generator polynomial corresponds to a very simple 
check polynomial 


7~—1 
hi =e eget: 
(x) A) x 


In the considered case N = 7 and the generator polynomial g(x) is of the sixth degree. Hence 
we obtain a (not very advantageous !) double-error-correcting code with code words contain- 
ing one information signal and six check signals. However, the general Varshamov-Gilbert 
inequality (****) on p. 327 shows that the number of check signals cannot be decreased here. 

The simplest method to improve upon the proportion of information signals is to increase 
the value of r. Let us assume that r = 4 and hence N = 2f —1=15. The corresponding 
Hamming code correcting all single errors is a (15, 11)-code determined by the check matrix 
written on p. 325.f¢ In the case of double-error-correcting code we must consider a field Fyg of 
24 — 16 elements determined by an irreducible fourth degree polynomial Q(x) with coefficients 
from F, = {0,1}. Let us choose Q(x) = x14 x41. It is easy to verify that this Q(x) is 
irreducible over F, and a root « of Q(x) is a primitive element of F,, represented by the collec- 
tion of all quadrinoms a, + 4,% + 4,02 + aga3. In analogy to the above example, we can also 
show that here 7 


M(x) = M(x) = M,@) = x4 + x41, 
Mj(x) == x4 + x94 x74 441. 


Hence we obtain 
B(x) = (X48 x + WY x8 4 x72 4 x 4-1). 


This is the first generator polynomial written on p. 335, 


tit is clear that this derivation isa general one : the minimal polynomials over a binary 
field F, = {0, 1} always satisfy the relations : M,(x) = M,(x) = M,(x) = Mg(x)=,..-+ 
M,(x) = Me(x) = Mao(x) = ,.--» Mg(x) = Mip(x) =... and soon. These relations imply, 
in particular, that the inequality K < dr (where N = 2" — 1) implied by (12) can be replaced 
by a stronger inequality K < dr/2. 

ttThe single-error-correcting Hamming codes are simultaneously the Bose-Chaudhuri- 
Hocquenghem codes corresponding to d = 2. 
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For a triple-error-correcting code (correcting all the single, double and triple errors) d = 6 
and hence here also we can choose r = 4 (i.e., N = 15). We consider again the field F,, deter- 
mined by the irreducible polynomial Q(x) = x4 + x + 1. Leta root « of Q(x) be again selected 
as a primitive element of Fs. As above, we obtain 


M(x) = M2(x) = Ma(x) = x4 + x + 1, Mg(x) = Mo(x) = x4 + P+ tx+), 
M,(x) =x? +x + 1. 


Hence in this case g(x) = (x2 + x + 1)(04 + x 4 108 + 8 + x8 + x 4:1). This is the second 
generator polynomial written on p. 335. 

The above construction of the Bose-Chaudhuri-Hocquenghem codes can be generalized 
quite easily to the case of a non-binary communication channel employing p" elementary sig- 
nals, where p is an arbitrary prime and a is an integer. Fora study of this aspect, the reader 
is referred to books on coding and information theory mentioned on pp. 305-306; we shall not 
dwell upon it here. 
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Properties of convex functions 


A function y = f(x) is said to be convex upward (or, for short, simply convex) 
on an interval from x = a to x = b if in this interval every arc MN joining two 
points of the graph of the function lies above the corresponding chord MNt{ 
(Fig. 31). There are numerous examples of convex functions including the 
following : the logarithmic function y = log x in the entire domain, i.e., from 0 
to co; the power function y == —x™ in the same domain where m > 1; the 
exponential function y = —a* in the domain from —oo to +00; the function 
y = —x log x in the domain from 0 to oo, and the function 


y = —x log x — (1 — x) log (1 — x) 
in the domain from 0 to 1 (Fig. 32, a — e). 


aAS Bb 
Fig. 31. 
Theorem 1. /f y = f(x) is a convex function on an interval from ato b and 
Xy, X_ are two values of the argument of this function within this interval (i.e., 
two arbitrary numbers such that a < x, < x, < 5), then 
Xi) + f(x x + x 
S( ) t( dep 1 2 : (1) 
2 2 
Proof (Cf. also p. 48). Suppose that, in Fig. 31, OA = x1, OB = x,; in such 
a case AM = f(x,), BN =f(x,). Furthermore, if S is the centre of the segment 
AB, then OS = (x, + x,)/2 and, consequently, SP = f[(x, + x,)/2]. On the 
other hand, since the middle line SQ of the trapezium ABNM is the mean of - 
the base 4M and BN, hence SOQ = [f(x,) + f(x,)]/2. But,--by the definition of 


_ fIn differential calculus it is shown that the following convexity test holds for_a sufficiently 
wide class of functions {in particular, for all functions considered in this appendix) : the 
function y = f(x) is convex on the interval a <.x <b if its second derivative | y* is. everywhere 
negative (i.e., ¥” < 0) in this interval. 
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convex functions, the midpoint Q of chord MN lies below the point P of arc 
MN; consequently, 


Led fed < (im), 


giving the desired proof.t 


a) 
; (5) 


y 


y = —x log x — (1 —x) log (1-x) 


0 (e) 
Fig, 32. 


tOur proof is restricted to the case in which f(x,) and f(x,) have the same sign (in fact, 
only this case will be needed later by us). The reader may consider independently the case in 
which f(x,) and f(x) have different signs (here the properly of the midd'e line of the trape- 
zium must be replaced by the following statement : the segment of the middle line of a trape- 
zium included between its diagonals is equal to the half of the difference of two trapeziuwn bases). 


PROPERTIES OF CONVEX FUNCTIONS 349 
Examplesf{ 


(a) y= log x. We have 


log x, + log x, < log X1 +X, 
2 


> 
Le., 

log ni X4X_ < log a 
or, finally, 


ie., the geometric mean of two distinct positive numbers is less than their arith- 
metic mean, 


(b) y= —x™, m> 1. Here we obtain 


_ At xe -(45)" 
ae a 2)? 


or, in the different form, 


oP (AS iE (ery > X1 + Xe 
2 2 2 ; 


The expression 


(eto tary 
re 


the mth root of the arithmetic mean of mth power numbers aj, dao, ... , Gx, 18 
called the exponential mean of order m of these kK numbers (in particular, the 
expression 


(after +a), 
k 


tin the contents of this book we have substantially used only inequalities related to the con- 
vex functions y = —x log x and y = log x (as well as y = —x logx — (1 — x) log (I — x)). 
Example (5) has here and hereafter on'y an illustrative value. (The theory of convex functions 
is, in fact, a rich source of curious inequalities, which permits us to multiply arbitrarily the 
nymbef of examples.) 
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corresponding to the case m = 2, is called the root-mean square of the numbers 
@,, 4,...,@z). Thus, the result obtained can be formally stated thus: the 
exponential mean of order m > \ of two distinct positive numbers is always greater 
than their arithmetic mean. 


(c) y = —x log x. From Theorem 1 it follows that 
__ x log x1 + X2 log x2 Mt Xe X1 + Xe 
2 * a Ee 
or, 
syneys apie ee eee + x,) lo SENS 
2 *1 1 zy % Xe 7 2) 10g >” 


a conclusion that we have used twice (see pp. 49 and and 65). 
The inequality of Theorem }-can be generalized asshown in the next theorem. 


Theorem 2. If the function y = f(x) is convex in the interval from a to b, 
X, and x, are two arbitrary numbers in this interval (i.e:,.a <x, < x_ < b) and 
p and q are some arbitrary positive numbers, whose sum is 1, then 


Pf(x1) + af(x2) < f(px, + 9x3). - (2) 


For p = q = 3, Theorem 2 reduces to Theorem 1. 


Fig. 33. 


Proof. We first note that, if M and N are two points with coordinates (x, y,) 
and (x2, yo) and Q is a point of the segment MN, dividing this segment in the 
ratio MQ: QN = q: p (where p + q = 1), then the coordinates of the point Q 
are px, + qx2 and py, + qyz. Indeed, we denote by X,, X, and X; Y,, Y, and 
Y the projections of points M, N and Q on the coordinate axes (Fig. 33). The 
points XY and Y then divide the segments X,X, and Y, Y2 in the ratio q: p. 
Hencet 


OX = OX, + XX = x + G(%2 — x1) = (1-9) x + Xa = pri. t 9%, 


tFigure 33 depicts the case in which all four numbers x,, x,, y, and yg are positive (essen- 
tially this is the only case needed by us), The reader can independently examine other cases. 
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and 


OY = OY, + YaY = y. + Ps — ye) = (1 — Pye + PY1 = DV1 + QW 2- 


We now consider again the graph of our convex function y = f(x) (Fig. 34), 


aA S BO 


Fig. 34. 


and let OA = x,, OB = x,, AM= f(x;), BN = f(x2). By what has been proved 
above, the coordinates of the point Q, dividing the segment MN in the ratio 


MOQ:QON=4q:p 
are px, + qx, and pf(x,) + qf(x.). Thus, in Fig. 34, SO = pf(x,) + qf(x,) and 


SP = f(px, +92) (because OS = px, + qx,). But, because of the convexity of 
y = f(x), the point Q is located below the point P; hence, 


Pf (x1) + af(x2) < (px, + 9%), 
giving the desired proof.t 


Examples 
(a) y = log x. In this case, inequality (2) yields 
P log x; + q log x, < log (px, + 4x2). 
Hence it follows that 
xPx§ << px, + qx, pt+q=l. 


(b) y= —x™, m> 1. We have 


TIt is trivial to see that the coordinates of each point of the segment MN can be represented 
in the form (px, + 9x2, Py, + qy2), withp > 0,q¢>0,p+q=1. Thus, inequality (2) says 
that the entire chord MN lies below the curve y == f(x), i.e., it is equivalent to the definition 
of a convex function. 
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—pxt — qxf << —(px, + qx2)™, 
or 
PxT + qxf > (px, + qx)”, ptq=tl. 


(c) y= —x log x. Here we obtain 


— px, log px, -- 9X2 log gxz << —(px, + 4x2) log (px, + 9x2), P+q=1. 
Theorem 1 can also be generalized in another direction. 


Theorem 3. If y = f(x) is a convex function in the interval from ato b and 
Xi, X2,.-- 5 X~ are any k values of the argument of the function in this interval, 
none of which is equal to any of the others, then 


S(%1) + f(%,) +... + fxn) x1 -+ aot) 
k —-* 


<f( . 


(a particular case of Jensen’s inequality). 
For k = 2, Theorem 3 reduces to Theorem 1. 


(3) 


Fig. 35. 


Proof. To start with, we define a concept frequently encountered in geomet- 
ric and analytic problems. Suppose that M,M,M,... Mz is an arbitrary k-gon 
(Fig. 35a). Let us also assume that Q, is the midpoint of the side M,M, of 
this k-gon (M,Q2 : Q2.M, = 4:4); Qs, Qy,-.., Qe are the points that divide 
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the segments M,Q2, M,Q3,..., MiQx-,, respectively, in the ratios 2: 1 (i.e., 
MQ; : Q30,=3:4)),3:1 (ie, M,Q,: 0.03 = St Divecs (eels! (ie. 


k-—-1. 1 
MrQ: ° O:.QOx-1 == a $ =); 


The point Qx is called the centroid (or the centre of gravity) of k-gon 
M,M,...Mzx. In the case of the triangle M,M2M, (Fig. 35b) the centroid Q, 
is the point of intersection of its medians : indeed, in this case Q, is the midpoint 
of the side M,M,, the segment M,Q, is a median and the point Q,, dividing this 
segment in the ratio M,Q,: Q,0.= 2:1, is a point of intersection of the 
medians of the triangle. 

Let us now show that, if the coordinates of the vertices M,, M2,..., My of 
a k-gon are (x1, V;), (X2, Ye), «++ 5 (Xk, Ye), then the coordinates of the centroid 
Or are (x, + x. +... + xx)/k and (y, + yo+...+ ye)/k.t Indeed, by the 
propositions deduced in the beginning of the proof of Theorem 2, the points 
Q., Qs, Qy,..., and, finally, Oz have the following coordinates: 


o,( 5 not | 
3 y) ? 2 » 
2 X, + X, 1. 2 V+ Ye 1 ) 
as(4 7 tyes py typ 
or, what is the same, 


(Stet% wth | 
3 ‘ 3 , 


3 Hy tm+t%X, A 3 vit Ye + Vs 1 ) 
a(3 3 Tg te a 3 + a Ya) 
or 


(te eet ee ee 
. 4 


4 ’ 
(k — 1) X, + Mo +... + Xp 1 
er cas mee 
(k — 1) Vi =F Yet... t+ Ve~-) 1 

Sp ge ae ye), 


o 


tHence, it follows in particular that the centroid of a k-gon is completely determined by 
this k-gon and does not depend on the order of enumeration of its vertices (as can be believed 
from the definition of a centroid). In the case of a triangle this last circumstance also stems 
from the fact that the centroid of a triangle is the point of interse¢tion of the medians. 
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or 


(tet tueits Ati Bi), 
k 4 k 


aA, As Ab 


Fig. 36. 


We now recall our convex function y = f(x). Suppose that M,, My,..., Me 
are k successive points of the graph of this function within the considered in- 
terval (Fig. 36). Because of the convexity of the function, the k-gon M,M,... 
My, is convex and lies wholly below the curve y = f(x). If the abscissas of the 
points M,, M,,..., Mz are x,, X2,..., Xs, then their ordinates are obviously 
f(x1), f(x), ..., f(x). Hence, the coordinates of the centroid Q of k-gon 
M,M, ... M, are given by 


Xi + Xo +... + Xe 


f(xy) + flr) +... + f(xe) 
k k ’ 


and 


and, consequently, 


of Oy Ae ee Ee — fe) + fle.) +... + SX) 
OS = i , ah ae. 


and 


sp—s( tet =) 


(see Fig. 36). However, the centroid of a convex k-gon lies always interior to 
the k-gon (this is implied by the very definition of a centroid). Consequently, 
the point Q lies below the point P and, hence, 


F(x.) + f(x) +... + fr) H+ x+...+%x 
aE <s( k +), 


giving the required proof. 
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This reasoning is conserved also in the case in which some of the points M,, 
M2,..., M, (but not all!) coincide (some of the numbers x,, x2, ..., X« are 
equal among themselves). The k-gon M,M.... Mz is here obviously expressed 
as a polygon with a smaller number of vertices. 


Examples 


(a) y = log x. From Theorem 3 it follows that 


< log a} et 


log x; + log x,+...+ log xz 
k 


or 


‘i X,t%+...+%x 
Vx gee 


We see that the geometric mean of k positive numbers, at least two of which are 
distinct, is less than their arithmetic mean (the theorem on geometric and arith- 


metic means). 


(b) y= —x™, m > 1. In such a case we obtain 


_ xP txt... +8 < —(2tBt ts ): 
k 


k b J 
or 
(es EE ee ee 
— k k 


This shows that the exponential mean of order m > 1 of any k positive numbers, 
at least two of which are distinct, is greater than their arithmetic mean. 


(c) y = —x log x. In this case, Theorem 3 yields 


___*1 log x, + x, log x, +... + x4 log Xx 
< — log ( “24% 4 +), (4) 


Finally, we shall prove one more theorem, which is an extension of both 
Theorems 2 and 3, 
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Theorem 4. Suppose that y = f(x) is a convex function in the interval from a 
to b, and x,, Xx, ..., X, are any k values of the argument of this function, taken 
in that interval, none of which is equal to any of the others, and p,, P2,... » De 
are k positive numbers whose sum is unity. In such a case 


Pr f(x) + paf(xe) +... + Pa fxn) < fC pi%1 + Poe +... + Paxe) (5) 


(the general Jensen inequality). 
For k = 2 Theorem 4 reduces to Theorem 2, and for p) = pp = ... = Pe = 
1/k to Theorem 3. 


a A, Ay. A, «} 
Fig. 37, 


Proof. Consider again the graph of the convex function y == f(x) and plot 
on this graph an inscribed convex k-gon M,M,... Mx, whose vertices have the 
coordinates (x;, y,), (Xe, Yo), .-- , (xe, Ye) (see Fig. 37). We suppose now that 
Q, is a point of the side M,M, of this k-gon such that 


Pe Pr 
M,Q::Q.M, = ———: ; 
102: 0,M, PitPr P, + Ds 


Q, is a point of the segment M3Q2 such that 


: Ps Pi + Pa 
MQ. : a 
0: 2:01 Pi, t+Pe+ Ps Py t+ Pe + ps’ 


Q, is a point of the segment M,Q, such that 


Pa Pi t+ Py t+ Ps 
M,Q.: 0,0. = ———"*_—___: tet Ps... 
Aor, 0,0; Pit+ Pe+ pPpt+ Pa Di + Pet Ps + De 


finally, Q is a point of the segment M,Q;_, such that M.Q: QQ,_, = Px: 
(p, + Po +... + Py-y). (Obviously, if p, =p, =... = py = 1/k, then Qis 
the centroid of the k-gon M,M, ... M,.) We make use of the proposition with 
which we started the proof of Theorem 2 to obtain the coordinates of the points 


Q2, Qs, Oipe ees Q, by 
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0,( PiX1 + PoX2 Prf(X1) + Paf(%a) } 


Pit Pa ’ Py + p2 
Q; ( Pi + De PiX1 + PoXs Ps ‘ 
*\ pit pet ps PitDa Pit pPst+ps ~” 


Pit Ds Pif(%1) + Dof(x2) Ps fi )) 
Py + Pa t+ Ps Pit Ps Pit Pat Ps * 


or 


( PyX1 + PoXe + P3Xa pif (x) + Paf(xXs) + Paf(xs) ), 


Pitpetps, ’ Pi + Pa + ps 
2.( P, + pe t+ Ds P\X1 + PoX2 + PXz Ds 
P, + Pa + Pa + De P, + Pa t+ Ps Pi + Pa t+ Pa + Da *o 


Pit Pat Ps__ pif) + Paflre) + Pofled) , Pa yy ), 
Pi + Pa + Pa t+ Dg Pi + Pa + Ds Pi + Pa t+ Ps + Pa 


or 


( PiX, + PaX2 + PaXy + DaXs Pif(x1) + pof(%e) + Paf(%s) + pee), 
Pit P2 + Pa + Py ° Pit Pat Py t+ Do 


Q ( PiX, + Pexe + . ane Pky Xk + PaXe 
ee ee 222 Pea + Pe 


Psf(x) + wie tee + Pea f(%e-1) + polis), 
DE x - + Pr-1 + De 


or, differently, 
(yx, + Paxe + 2. + DeXns Pif(%1) + Paf(x2) +... + pef(xx)) 


(since p, 4+ P2 + eae +e = 1). 
Thus, in Fig. 37, 


SQ = Pif(%s) + paf(x2) +... + piflxe), 
OS = pyX1 + PaXa t+... + PrXe, 
SP = f(PyX1 + Pata + «~ + + Prxt). 
But since the point Q lies below the point P (because the entire k-gon M,M,. .. 


M, lies below the curve y = f(x), and Q is an interior point of this k-gon), we 
have 


Pif(X1) + Pof(%2) +. ~ + Pef(%x) < f(D, + PaXe +... + Pere), 
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giving the required proof. 
Examples 

(a) y = log x. We have 

p, log x, + pz log x, +... + Pelog xx < log (pix, + paxg +... + exe), 
implying that 

Ut xha KB < pyxy + poX, + ~~ + + PeXky 

where Pit pot... t+pr=l 
(the generalized theorem on the geometric and arithmetic means). 


(b) y = —x™,m> 1. We have 


Px — PoxB — 2.2. — Dax < —(pixy + Pox, +... + Pexk)™, 
or 
Pixf + poxf +... + pexx > (Pix. + Pox2 + ~~ + PeXn)™s 
where Pit Pot... +pe=l. 


(c) y= —x log x. Theorem 4 yields 
— PX, log x; — Pox, log X2 — .. + — pkeXx 1OB Xx 
< — (pik + Pag +... A pare) log (yxy + poXa +... + PeXt)s 
where PrtPet..- $m =1. (6) 


The derivation of inequalities (4) on p. 355 and (6) is the basic aim of this 
appendix. From inequality (4) it is immediate that the entropy of an experiment 


fIt is trivial to see that the coordinates of every interior point of the k-gon M,M,... My, 
can be represented in the form 


"(Diy + PaXa + e+ HF Pees Pif(%1 )+ Paf(%s) + 2 + Pef(%r))s 
where p, > 0,p, > 0,...,p, > 0,and p, + ps +...+ 7, = 1. Thus, inequality (5) ex- 


presses the situation that a polygon inscribed in the graph of a convex function wholly lies 
below this graph. 
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a having k outcomes does not exceed the entropy log k of an experiment a, with 
k outcomes of equal probability; also, H(«) = log k if and only if all outcomes 
of « are equally probable, i.e., if « is not different from eo. Actually, we multi- 
ply both sides of inequality (4) by & and then substitute in this 


n= p(A,), xX, = P(A), ce Mk = p( Ax), 


where A,, Ao, ..., Ag are outcomes of « (so that p(A:) + p(A,) + .. . -+ p(Ae) 
= 1; the probabilities p(A,), p(A2), . .. » p(Ax) are not all equal among them- 
selves). In such a case, we have 


— p(A,) log p(A;) — p(A,) log p(A2) — ... — p(Ax) log p(Az) 
< —[p(A;) + p(A2) +... + p(Ax)] 


‘I6g p(A;) + rat ~.« + pA) 


=—1 x log = = log k, 


or 
H(a) < H(%9). 


Ineqality (6) can be used to prove that the conditional entropy Ha(f) of B 
given « does not exceed the unconditional entropy H(8) of 8. In fact, let us 
put in inequality (6) 


P= P(A,), Ps: >= P(A,), coe De = P(Ag), 
X, = pa(B,), X. = pag(B,),«.. , Xe = pay(B,) 


(where A,, 4,,..., Az and B,, By,..., Bs are outcomes of « and 8; p(A,) 
+p(A,) +... + p(Ag) = 1). Then, we obtain 


—p(A;)pa,(B;) log pa,(Bi) — p(A2)p4.(B,) log pa(B,) — ... 
— p(Ax)pax(B,) log pax( Bi) 
< —[p(A1)pa,(B,) + p(A2)pa2(B,) + .- . + p(Ax)pay(B)] 
x log [p(Ay)p41(B:) + P(Aa)pao(B,) +... + p(Ae)par(B,)). 
Since by the equation of total probability (see p. 23) 


P(A;)pa, (By) + p(Aq)pa.(By) +... + P(Ax)pa,(B,) = p(B), 
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the last inequality can be rewritten as 


—p(A,)pa,(B,) log pa,(B,) — p(A2)pa,(B,) log pag(B,) — ... 
— p(Ax)pax(B,) log pa,(B,) < —p(B,) log p(B,). 


We note that, if p4,(B,) = pa.(B,) = ... = paxr(B,) = p(B,) (the last equality 
here follows from the equation of total probability), then our inequality turns 
into an equality. In exactly the same way, we obtain 


— p(A;)pa,(B,) log pa,(B:) — p(A,)pa.(B,) log pa,(Ba) — .. - 
— p(Ay)pa,(B,) log pa,(B,) < —p(B,) log p(Ba), 


— p(A1)pa,(B;) log pa,(B,) — p(A2)pa2(B,) log pa,(B) — ... 
— p(Ax)pax(B,) log pa,(B1) < —p(Bi) log p(B). 
We now add all these inequalities to obtain 
P(Ay)Ha,(B) + p(A,)Ha,(8) +... + p(Ax)Ha,(8) < H(8), 


or 


Ha(8) < H(§). 


This inequality holds if experiments « and § are not independent, i.e., if there 
exist / andj (1 <i<k,1<j </) such that p4,(B;) A p(B;). If, however, 
experiments « and § are independent, then, obviously H.(3) = H(6). 


We further note that, if we substitute x, = 9,/p.1, X2 = qeqlpe,.-.+»Xk = GelPe where 
G+ 92+ .-+ + 9% <1, in the inequality 


P, log x1 + pz log x, +... + Py log x% < log (pyx, + pax +... + PRY)» 
Pit Pat.--+p=l 


(see Ex. (a2) on p. 358), then we obtain 


px log + pz log 2 +... + pelog © < log (qi + 42 +--+ +9) < log 1 =0. 
PA Pa Pr 


But 


log 2 = log g, -- log pr, log 2 = log gy -- log ps,..., log = log gy — log py, 
Pi P2 Pr 
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hence 


—p, log p: — p, log pa — .. . — px log pe < —Py 10g 4; — Pa log g, — . . - — pe log ges 


i.¢., We arrive at inequality (*) on p. 134. 
Finally, let us consider the inequality 


which generalizes inequality (8) < H(B), and has been mentioned at the end of Sec. 3 of 
Chap. 2. (This inequality turns into H,(8) < H(@) if it is assumed that experiment y has a 
single outcome realized with probability 1.) It is easy to deduce the considered inequality 
from inequality Ha(8) < H(8). Indeed, we denote by C,, Cy,..., Cm the outcomes of an 
experiment yy; suppose «(1) and 81) to be experiments with outcomes 


AM, AM), ..., Af) and Bi), BY),..., Bf! 
having probabilities 
PAM) = po,(Ay), p(AS?) = Pey(4s),- » - » (ALY) = Pe, (An)s 
and 


p(BY) = pe,(B,), p(BS”) = pc, (Ba), - - - » P(BY”) = pc,(By)s 
respectively. By what has been proved above, we have 
H,()(B)) < ABO), 


But 


HB) = —p(BY) log p(Bi”) — p(B?) log p(Bi») — .. . — p(BY”) log p(B) 
= —pc,(B,) log pc,(B1) — pc,(B2) log pc,(By) — - - « — Pc,(B;) log pc,(By) = Hea(8) 


and 

HBO) = PAM)H gar) + PAM )H G80) +--+ PAPA hr), 
where 

H (1B) = —P goy(BY) log pyiy(BE?) — pave? ) log p(B) ~... 


= Pg(BP?) log p4(BP), 
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14018) = —p (BY) log p AgnBY) — p4Q(BP) log P4((By”) ee 


- P4(a)(B{) log PA(BY), 


A486) = —P 4B.) log P4t(Br”) — P4uy(B2?) log P4an(Bs") eee 


—P4t(By”) log P4uv(BY?). 


It is now required to find the conditional probabilities 


PBI), P4crBy), « « 


By the multiplication law of probabilities (see Sec. 3 of Chap. 1, p. 22) P 4(1)(B) is equal 
1 


to the ratio of the probabilities of the events A‘) B® and AW, But p(A!) = pc, A,); as to 
the probability of the event 4{°g(), it is obviously equal to the conditional probability 
Pc,(A;B;) (since A‘) is the occurrence of event A, under the condition that event Cy occur- 
red and B{”) is the occurrence of event B, under the same condition, A!” BY is the occurrence 
of A,B, under the same condition). But by the multiplication law of probabilities pc,(A,B,) = 


Pc,(Ai)pc,A,(Bi); consequently, 


P(A BY) 
Zs Pox(Ai)pcrar( Bs) = Pc, A, (Bi). 


Bw = 
P g(a) ( i) p(A™) Pc,(Ai) 


In exactly the same way, it is shown that 


P 4i(Bs") = Pc.A,(B,), p4i(Bs) = Pc, A\(By), +++, P4(Bi) = Pc, Ax(B)). 


Hence, we obtain 


H (vB) = —pc,A,(B:) log pc,A,(B,) — Pc, Ay(Ba) 108 pc, Ay(Ba) — . . - 


—Po,a(B;) log Po, a8) = Ho, 4,8)» 


and similarly, 


A418) = He, A,(8), 1 (8) aaa {cA (8), voor An 8%) = Ho, A;.(8)- 
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Thus, recalling that 
p(AL) = pe,(Ay), P(AY) = Per(Aa),--- PAD) = pe, (Ap), 
we have 
H(2)(80)) = po,(A1)He,4,(8) + Per(Aa)He,4(8) + --- + pc(An)Hc,4,(8)- 
We see that the inequality 7,{2(8) < A(6@)) can be written in the form 
Po (Ay) Hera) + Poy(As)He,4,(8) + +» + Pcp) Hcy 4x(B) < He,(8)- 

Multiplying this by p(C,) and noting that 

P(Cy)pc,(Ar) = RC1A1), P(Cy)pc(A2) = P(CiAg), - . - » P(Ci)PC,(Ax) = P(C,AR), 
we have 

P(C1A,) Hc, A, (8) + P(C1A2)He,Ag(8) +.» + P(CLAg) Hc, 4, (8) < P(Ci) Hc, (6). 

In just the Same way we establish the inequalities 


P(C,A;)Hc,A,(8) + P(CaA2)Hc, A,(B) +--+ + P(CrAR)H cA, (8) < P(Ca)Hc4(8), 
PCC A1) HemAy(B) + P(CmA2) HomAo(8) +. ~ » + P(CmAR) HemApB) < P(Cm) Hem (8)- 
Adding termwise all these inequalities, we obtain 


Hya(b) < Hy(8), 


giving the required proof (events ya and ay are not distinct). 
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Some algebraic concepts 


The main subject of study in algebra is some algebraic systems, i.e., sets of 
elements, for which there are defined some algebraic operations, similar to the 
well-known arithmetic operations of addition and multiplication of numbers. 
Moreover, the nature of the elements of such a system and the concrete mean- 
ing of-the operations under consideration are usually not specified, so that one 
and the same algebraic scheme can describe many diverse examples. On the 
contrary, the properties of algebraic operations are described explicitly, and 
this description forms the definition of the corresponding system. 

1. The first algebraic concept, extensively used in many branches of math- 
ematics, is the concept of a (commutative) group. 

A set G of elements a, b, c,. . . is called a (commutative) group if on this set 
an operation o is defined, assigning to each pair of elements a and b of our set 
a unique third element denoted by the symbol ae b, and for which the following 
Properties hold : 


G1 : The operation © is commutative :f 
ach = boa for every a and b inG; 

G2 : The operation © is associative : 

(ac b)oc = ao (bec) for every a, b and c inG; 
G3 : In the set G there is an identity element e such that 

aee =a for every ainG; 
G4: For each a in G there is a symmetric element a* such that 
aca*=e, 


The group operation © is sometimes denoted by the symbol + (additive 
group notation). In such a case, the element a + b is called the sum of elements 
a and b; an identity element e such that 


a+e=a for every a 


tIn algebra non-commutative groups are also often considered, for which the property Gl 
does not hold. However, since only commutative groups are encountered in this book, we 
have agreed, in departure from the convention, to include G1 in the definition of a group. 
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is called the mull or zero element or simply the null or zero of the group and 
usually denoted by 0; a symmetric element a* such that 


a+a*=0 


is called the negative of aand is written —a. The result ae 5 of applying the 
group operation to the elements a and b may also be written as a xX b or ab 
(multiplicative group notation). In such a case, ae = a for every a, and hence 
e is called the unit element or the unit of a group and sometimes denoted by 1; 
furthermore, aa* = 1 and hence a* js called here the inverse of a and written 
a”. We shall hereafter always denote a group operation by the symbol + ; for 
this, we denote by a — b an element x (the difference of elements a and b) such 
that x + b = a (it is trivial to see that such element x always exists : it is equal 


to a+ (—b)). 


Examples 


A. A set of integers (or rational numbers, or real numbers) forms a group with 
respect to addition. In other words, the corresponding set, where (ordinary) 
addition is taken as the group operation, forms a group with the null element 
0 and the negative —a of a. 


B. We agree to take the multiplication of numbers as a group operation 
(which we now denote by the symbol < + &, in order to emphasize that this 
is not ordinary addition). For this, a set of integers, however, does not form a 
group, because here G4 is obviously not satisfied: in fact, an integer a* such 
that a< + > a* = aa* = | exists if and only if a = 1 or a = —1. Similarly, 
a set of all rational numbers as well does not form a group with respect 
to multiplication because here G4 is violated for a = 0. However, a set of all 
nonzero (or positive) rational numbers (or real numbers) forms a multiplicative 


group. 


C. We consider again a set of integers and define on this set an operation of 
addition of numbers. We now choose any positive integer gq and agree to re- 
Place every integer A by the remainder after the division of A by q. Thus, say, 
if g = 10, then we agree to leave from each positive integer A only its last 
digit a (this also is the remainder obtained from the division of A by 10). A set 
of all possible remainders obtained from the division of all integers by q (formed 
of q numbers 0, 1, 2,..., q — 1) is called a g-arithmetic, the sum of the ele- 
ments a and 5 of a q-arithmetic is the remainder obtained from the division of 
the usual suma + bbyq(=a+)bifa+ 5b < q). Tables of addition in 2-arith- 
metic, 5-arithmetic and 6-arithmetic might look like the accompaying Tables 
1,2 and 3, 
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TABLE 1 TABLE 2 TABLE 3 
+/0 1 + |/01234 + (012345 
0|0 1 0 |901234 01012345 
1 }1 0 1,12340 11123450 
21;23401 21234501 

3 |/34012 3 |345012 

4140123 41450123 

5 '501234 


It is easy to see that a g-arithmetic defined with respect to addition is itself a 
group of q elements (or as is said, is a group of order q). The null element of 
this group is 0 and the negative of a # Ois the number q — a (because the sum 
a -+ (q — a) when divided by q gives the remainder 0). For 2-arithmetic, the 
negative of every number a (i.e., both for a = 0 and a = 1) is obviously a 
itself : here —a =a always. 


D. Suppose that G is an arbitrary group, say, a group of integers with res- 
pect to addition or a group of additions of numbers in a q-arithmetic. We 
now consider an arbitrary rectangular array of m rows and n columns, or an 
(m X n)-matrix 


Qi; Aye Qin 

Az, Ag2 aon 
AHS|0 wha vedet ewes 5 

Qm1 Qma Amn 


composed of the elements of G, which we shall hereafter call numbers. It is 
clear that if we agree to add the matrices elementwise (i.e., consider that a 
number appearing at some place in the matrix-sum is equal to the sum of the 
numbers occurring at the same places in the matrix-summands), then we arrive 
at an additive group of (m X n)-matrices; the null element of this group is the 
zero matrix O, which has all zeros. 

(1 X #)-matrices are also called vectors (or, row vectors); similarly, (m Xx 1)- 
matrices are called column vectors. Obviously, vectors with a fixed number 
of elements in row (or, column) also admit addition with each other; if the 
elements of a vector belong to some group (‘group of numbers’), then the set 
of all vectors also forms a group with respect to addition. Vectors are mostly 
denoted by small bold-face Latin letters; the ‘null vectors’ (i.e., a row or a column 
composed of Os) is sometimes denoted by a boldface 0. 

If a group G of ‘numbers’ is infinite, then a corresponding group of.(m X n)- 
matrices (in particular, of vectors) is also infinite. If, however, G is of finite 
order qg, then a group of (mx n)-matrices is of order q™"; in fact, a matrix has 
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mn elements, in place of each of which we can substitute any of the qg elements 
of G. Similarly, a group of row vectors of m elements and a group of column 
vectors of m elements are, respectively, of finite order g" and q™ if G is of 
order q. 


E. Consider an arbitrary polynomial 


F(x) = a9 + ayx + ane +... + an-)x™), 


whose coefficients do, a,,..., G@n—, are the elements of an arbitrarily chosen 
group G. If g (x) is another polynomial 


7 g(x) => ba + b,x + b,x? + ae + bay xn} 


(we assume that f(x) and g(x) are of the same degree because otherwise there 
can always be added to the one of lower degree some ‘leading’ terms with the 
coefficient 0 under them), then the sum of polynomials can be defined by 


f(x) + g(x) = (ao + Bo) + (a, + bi)x + (ay + by)x? +... 
+ (a,-, + bn-;) x*-?. 


It is easy to see that the polynomials with addition so defined form a group. 
Obviously, this group is always infinite, because the degree of the polynomials 
can be arbitrarily large. The role of null element of this group is played by 
the ‘null’ polynomial 0, all of whose coefficients are 0; the negative of f(x) is 
the polynomial —f(x), all of whose coefficients are the negative of the coefii- 
cients of f(x). 

If we confine ourselves to polynomials of degree less than n, where n is some 
fixed number, then also we obtain a group; as is easy to see, it differs from the 
group of vectors 


FS = (do, 4, Ga... 5 An-) 


only in the form of writing the elements of a group. This group is finite, if the 
group Gis finite; if G is of order g, then the order of a group of polynomials of 
degree < n is qn, Thus, say, there are in al] 22 = 4 polynomials of degree < 2 
with coefficients from a 2-arithmetic: 0, 1, x and x + 1; a ‘table of addition’ 
of these polymonials looks like the accompanying Table 4. 


TABLE 4 
+ 0 1 x x+1 
0 0 1 x x+i1 
1 1 0 x+i1 x 
x x x+i 0 1 
x¥+il x+1 x 1 0 
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Suppose now that G is an arbitrary group and that H is asubset of elements 
of G. If a set H of elements of a group is such that 


SG1 : if a, b belong to H, then a + b also belongs to H; 
SG2 : if a belongs to H, then —a also belongs to H; 
SG3 : the null element 0 of G belongs to H, 


then H itself forms a group with respect to addition defined on G. In such a 
case, we Say that H is a subgroup of G. 

It is easy to see that a subgroup can also be defined as a set H of elements of 
a group satisfying the unique requirement: if a and b belong to H, then a—b 
also belongs to H. In fact, then, evidently, 0 belongs to H, since 0 = a — a, 
where a is any element of H. Moreover, if a belongs to H, then —a also belongs 
to H, since —a = 0 — a; also if a and b belong to H, then a 4+- b = a — (—b) 
belongs to H. 

In particular, if G is an additive group of integers, then a collection H of all 
integers that are multiples of a fixed integer / forms a subgroup of G. In exactly 
the same way, if G is an additive group of numbers in a qg-arithmetic and q = kl 
is a composite number, then a collection H of all numbers belonging to G that 
are divisible by / (i.e., the numbers /, 2/, 3/,... , (k — 1)/) forms a subgroup of 
G (which differs immaterially, as is easy to understand, from an additive group 
of numbers in a k-arithmetic). 

A subgroup of an additive group of (m x n)-matrices is, for example, a 
group of all matrices, all of whose rows, except the first one, contain only the 
zeros (this subgroup is obviously equivalent to and only written differently 
from an additive group of row vectors), and also a group of matrices, all of 
whose elements are 0 except a fixed one, say, the element @,,, appearing in the 
left top corner (this subgroup reduces to the group G, because each of its ele- 
ments is given by the single number a,,). Furthermore, if G is a group of 
(m X n)-matrices A with elements from a 2-arithmetic, then in order to ensure 
that some subset of it is a subgroup it suffices only to verify that SG1 is 
satisfied (because in a 2-arithmetic every number is the inverse of itself and 
hence here A + A = O for each matrix A and, consequently, —A = A). 

A subgroup of a group of all polynomials is a group of polynomials of 
degree <n. For this, the latter group, a set of all polynomials of degree < k, 
where k <n, and a set of all polynomials that vanish for x =0 (i.e., polynom- 
ials that have zero ‘free term’ a,) both form a subgroup. 

If H is a subgroup of G, then a set of all elements of the form a + h, where 
a is a fixed element of G and h runs through all elements of H, is called a coset 
of H in G and is written a+ H. It is is easy to show that any two cosets of H 
in G are either disjoint (i.e., they do not contain any common element), or are 
exactly the same. In fact, if the cosets a+ H and b+ H have a common 


SOME ALGEBRAIC CONCEPTS ‘369 


element, then a + h, = b+ fy, where both /, and A, belong to H, and hence 
a—b=h, — hy, i.e., a— b = halso belongs to H. Therefore, the coset a + H 
can be represented as b+h+H= B+ (h+ H). But if h belongs to H, 
then h + H = H, since any element h, of H can be represented as an element 
h + (h, — h) of h + H, and any element h + h, of h + H belongs to H. This 
completes the proof of the italicized assertion. 

We see that a subgroup H determines the partition of a group G into disjoint 
cosets of H. If H contains a finite number n of elements, then any coset will 
contain n elements, too. Let us now assume that G is of finite order N (i.e., 
it contains N elements). Since all these elements must form a finite number of 
cosets, we obtain the following Lagrange’s theorem. 


LAGRANGE’S THEOREM. [If G is a finite group of order N and H a subgroup of 
G of order n, then N = nk, where k is an integer, i.e., n is a divisor of N. The 
integer k is, of course, equal to the number of cosets of H in G and called the 
index of H in G. 

Let G be a finite group and a an arbitrary element of G. Consider the 
sequence of sums 


a=0+a,a+ a=2a,2a+ a= 3a,3a+a=4a,... 


All these sums cannot be distinct since the number of distinct elements of G 
is finite. Moreover, if ia = ja, where (say) j > i, then ja — ia = (j — i)a = 0. 
The smallest integer n satisfying the relation na = 0 is called the order of an 
element a. It is easy to see that the elements la = a, 2a, 3a,..., (nm — 1)a, 
na = 0 form a subgroup of G which we call a cyclic subgroup of G generated 
by a. (If, however, the group G coincides with one of its cyclic subgroups, 
then G is called a cyclic group.) The order of the cyclic subgroup clearly co- 
incides with the order of a. Therefore, Lagrange’s theorem implies that the 
order of any element of a finite group G is a divisor of the order of G. It is also 
clear that if 1 is the order of a and ma = 0 for an integer m, then m must be 
a multiple of n. In fact, if m= kn +r, where r <n is a remainder of the 
division of m by n, then ma = (kn+ r)a=k(na) + ra =0+ ra= ra, i.e, 
ra = 0. However, this implies that r = 0 (since r< m and a is the smallest 
integer satisfying the equation na = 0). . 

Let us also note that, if we use a multiplicative group notation, then the order 
n of element a must be defined as the smallest integer n satisfying the relation 
a"=1. The cyclic subgroup generated by a consists in this case of the ele- 
ments @! = a, a? = a X a,a',..., a1, a” = 1 (=a). Moreover, the equa- 
tion a” = 1 is valid here if and only if m is a multiple of n. 


2. The other important algebraic systems are fie/ds and rings. 


A field is a set F of elements a, b, c, ... for which two operations are defined, 
associating with a pair of elements a and b of F a third element. These opera- 


370 APPENDIX II 


tions are called ‘addition’ (the ‘sum’ of elements a and b of F is written a + b) 
and ‘multiplication’ (the product of elements a and b is naturally denoted by 
ab). In addition, 


F1 : the elements of a field must form a group with respect to addition; 


F2 : the nonzero elements of a field must form a group with respect to multi- 
plication; 


F3 : the addition and multiplication must obey the distributive law 
(a + b)c = ac + be for all a, b and c. 


It is easy to understand that for any elements a and b of a field F, where b 
is different from null element 0, there exists their ‘quotient’ a/b, i.e., a number 
y such that by = a; this y can be defined by the formula y = ab-'. Moreover, 
it is clear that if 0 is the null element of a field (i.e., an identity element of the 
corresponding additive group), then a0 = 0 for every a (sinceO=1--1 = 
1 + (—1), and, therefore, a@ = a X [1+ (—1)] =al+a(—-1)=a-—a= 
0). It is also important to note that if ab = 0, then at least one of the elements a 
and b is necessarily equal to 0. In fact, if (say) 5 40, then the multiplication 
of the equality ab = 0 by b-! yields abb-! = 0b-+, i.e., al = 0 or a= 0. 


Examples 


A. It is obvious that a set of all rational (or real, or complex) numbers forms 
a field with respect to the operations of ordinary addition and multiplication. 


B. The product of numbers a and b of q-arithmetic is defined as the remain- 
der after division of the ordinary product ab by q; thus, say, the product of num- 
bers a and b ofa 10-arithmetic is just the last digit of the number ab. The 
multiplication tables for numbers in a 2-arithmetic, 5-arithmetic and 6-arith- 
metic assume the form of Tables 5, 6 and 7. 


TABLE 5 TABLE 6 TABLE 7 
x 101 x {01234 x |012345 
“0100. “9100000 01000000 
1/01 1/01234 1/012345 
9102413 21024024 
3103142 31030303 
4|04321] 41042042 
51054321 


A comparison of these tables enables us to make a salient distinction among 
them : whereas for 2-arjthmetic and 5-arithmetic each row of the table, except 
the first one, which has all zeros, contains a one, it is not so for 6-arithmeti¢ 
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(here, the Ist, 3rd, 4th and 5th rows do not contain a one). Thus, in 2-arith- 
metic and 5-arithmetic, every number different from 0 has an inverse (in 2-arith- 
metic we have 1-1 = 1; in 5-arithmetic we have the equalities 1-! = 1, 2-! = 3, 
3-1 = 2 and 4-1 = 4); on the contrary, in a 6-arithmetic, the numbers 2, 3 
and 4 have no inverse. Hence, it easily follows that 2-arithmetic and 5-arithmetic 
defined with respect to addition and multiplication are fields, but 6-arithmetic 
is not a field. 

It is trivial to see that for every composite g = kl (with k>1,I1> 1) a q- 
arithmetic cannot form a field : this stems, for example, from the fact that here 
kl = 0 (where multiplication is understood in the sense indicated above). If, 
however, p is a prime number, then in a p-arithmetic every number has an 
inverse (See p. 376 below); hence a p-arithmetic with the operations of addition 
and multiplication of numbers defined in it is a finite field F, of p elements (or 
a field of order p). 

A description of all possible finite fields shall be given later in this appendix. 
It is, however, convenient to consider now some general properties of finite fields. 
We know that all elements of a field form an additive group and all its non- 
zero elements form a multiplicative group. If Fy is a finite field of order N, then 
the corresponding additive group is of the same order N and the multiplica- 
tive group is of order N — 1. Every element a of Fw has an additive order n, 
equal to the smallest integer satisfying the relation na =0. If a0, then 
it has also a multiplicative order n, equal to the smallest integer satisfying the 
relation a"® = |. According to the general results stated above, the integer 7, 
is necessarily the divisor of N and the integer m, is the divisor of N — 1. 

The additive order of the unit element 1 (i.e., the smallest integer n satisfy- 
ing the relation 11 = 0) is called the characteristic of Fn. It is easy to show 
that the characteristic m is necessarily a prime. In fact, if k and / are two arbit- 
rary integers, then, evidently, 


(Alyx (=U +14+...¢+1I)xXxU4+14+...4+) 


—_— 


~~ - (Ss 
k terms iterms 


=1+14+...+1=4ki1. 


WV —-—_.,-— --_- YJ 
kl terms 


If now n= kl, where kn and /4n, then nl = kl = (k1) X (/1) = 0, but 
k1 40 and /I £0 by virtue of the definition of the characteristic . Since this 
is impossible, # must be a prime. Therefore, it is reasonable to denote here- 
after the characteristic by the letter p which is commonly used to write a prime. 
An example of a field having characteristic p is, of course, given by p-arithmetic. 

We know that the order N of a field F must be a multiple of its characteris- 
tic p. Later, it will be elucidated that N must have the form p*, where k is an 
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integer. In the coding theory related to the binary communication channels the 
fields of characteristic 2 are most important. Every such field consists of 2* ele- 
ments. 

The multiplicative order n of an element a of a field Fy must be a divisor of 
N-—1. The element a of a multiplicative group of order N — Lis called a 
primitive element of Fn. If ais a primitive element of Fy, then the elements 
a, a?, a3, ... , aV-?, aN-1 = 1 coincide with all nonzero elements of Fy. In 
other words, a multiplicative group of Fw is a cyclic group generated by the 
primitive element a. 

Let us now prove the following important assertion : any finite field Fy con- 
tains a primitive element a. Consider at first two elements b, and b, having 
relatively prime (multiplicative) orders n, and n,. Then, it is easy to show that 
the order of b,b, is equal to mnz. In fact, if (b,;5,)* = 1, then 


—figk 


bt = by*, b= hb, = (b)-* = 1, 


and, similarly, bgt* = 1. Therefore, n,.k must be a multiple of mn, and n,k must 
be a multiple of m,. Since m, and m, are relatively prime numbers, k must be a 
multiple of n,n,. Moreover, 


(b,b2)"1"2 = oy)" (Es -= 1xX1=1, 


and hence n,n, is equal to the order 5,by. 

Let us now consider an element a of Fw having the highest order n and 
suppose that n < N— 1. All elements of F whose orders are divisors of n 
satisfy the equation x" = 1, i.e, x7» —1=0. It is easy to show that any 
equation of order n with the coefficients from an arbitrary field F cannot have 
more than n distinct roots in the field F. (The proof of this statement is 
completely analogous to the proof of the well-known special case of it related 
to the field F of real numbers.) Since »< N—1, all nonzero elements of 
Fw cannot be the roots of equation x» — 1 =0. Hence, the field Fw contains 
at least one element b whose order m is not a divisor of nm. Let us assume 
that m = kl, where / is the greatest common divisor of m and n but k and xv 
are relatively prime numbers. Since b has order m, it is clear that b' has order 
k and, consequently, ab' has order nk. But this contradicts the assumption of 
n being the highest order of all nonzero elements of Fw. Therefore, n = N — 1, 
i.e., @ is a primitive element of Fw. 

If a field Fw has characteristic p, then, evidently, pa-=a+ta+...+a(p 
summands!) is equal to zero for every element a of Fw. In fact, a= a x 1 and 
pa=at+art....¢a=(axl)+@xi+...+@x)l)=a0l+1e... 
+1)=ax0=0. In particular, 2a4=0 for every element a of a field of 
characteristic 2. Therefore, we have 
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(a+ b® =(a+ bd (a+ bd=a@4+ 2ab+ BP=a+ BD, (1) 


(at+b+cP=[at+b+cP=@tbP+e=ae+h+c%, (2) 


and so on, so that 

(a, + a, + ...4 am)? = a? + a2-+-...4 a2. (3) 
Quite similarly it can be shown that in a field of characteristic p 

(a, + a;-+... +4 am)? = a? + aB+...4 ab. (3a) 


We shall now consider a simple statement which is used in the construction 
of certain error-correcting codes. Suppose that a and 5b are two distinct non- 
zero elements of any (finite or infinite) field F. It is easy to show that the equa- 
tions 


ax +-by:=0, ax + by =0, (4) 


where x and y are also elements of F, imply that x = 0 and y = O. In fact, 
if we multiply the first equation by 5 and then substract from it the second 
equation, we obtain 


abx — a®*x = 0, 1.e.,a(b —a)x = 0, 


and hence x = 0 (since a # 0 and b —a=+ 0). Then, of course, y = 0, too. 
Moreover, if a, b and ¢ are three distinct nonzero elements of F, then the 
following three equations 


ax + by+cez=0,a%x + By 4+ cz =0, 2x + By + Jz =0 (4a) 


imply that x = y =z = 0. In fact, if we multiply the first equation by c and 
subtract from it the second equation, and also multiply the second equation 
by c and subtract from it the third equation, then we obtain 


a(c — a)x + b(c — b)y = 0, a(c — a) x + B(c — by = 0. 


But these equations have the same form as equations (4), only x and y are now 
replaced by (c — a) x and (c — b) y. Therefore, (c — a) x = (ce — b) y = Oand, 
consequently, x = y = 0 and alsoz = 0. 

The same arguments can be applied to the case of four similar equations, and 
so on. By mathematical induction we may conclude that if a,, a,,..., @m are 
m distinct nonzero elements of a field F and 


A,X, + A:X_ Te T An Xm = 0, 
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a? x, + ax, +...4- a2, xm = 0, 


° ° e e e ° e s e oJ (45) 
ay x, + ay xe +... + anxm = 0, 
then necessarily x, = xX, =... = Xm == 0. In particular, the relations 
a, +a,+...tan=0, 
a+azi+...+a2,=0, (4c) 
av +aft...+am=0, 
where Q,, Q2,.. +, Qn are distinct nonzero elements of a field F, cannot be valid 


simultaneously. 


We now recall the case of an arbitrary qg-arithmetic, where q is in general a 
composite number. In this case, we do not obtain a field since there are ele- 
ments of g-arithmetic that do not have an inverse. However, all the rest of 
the conditions defining a field remain valid also for this case. 


A (commutative) ringt is a set K of elements a,b, c,..., for which the operations 
of addition and multiplication are defined and which has the following properties: 


R1: the elements of K form a group with respect to addition; 

R2: the multiplication of elements of K is such that ab = ba for alla and b; 
(ab)c = a(be) for all a, b and c; there is an element 1 such thata X 1 =a 
for every a; 

R3 : addition and multiplication obey the distributive law 


(a+ b)e = ac + be for all a, b and ¢. 


tHere also we depart from the usual! convention according to which in defining a ring we 
require addition to be commutative (the equalily a + b = b + a being satisfied for all a and 
b), but this is not extended to requiring mulliplication to be commutative, i.e., to the indis- 
pensability of the eqality ab = ba. (We also note that the requirement of the existence of the 
unit element | is also sometimes not included in the definition of a ring.) 
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Examples 


(a) It is plain that a field is a special case of a ring (a field is a ring with divi- 
sion); hence all examples of a field are simultaneously also examples of a ring. 

(b) A set of all integers forms a ring (with respect to the operations of ordin- 
ary addition and multiplication over numbers). 

(c) A collection of all polynomials with coefficients in some field F forms a ring 
with respect to termwise addition and multiplication of polynomials : if 


a(x) = ay + ayx + ax? +... 4 Gyiyx", 
and 


B(x) = bo -+ yx + box? +... Dmax™, 


then 


a(x) b(x) = a,b) + (aob, + a,bg) x + (ayb, + a,b, + Agby)x* 
tw + An—yDm_y xO, 


The null element of this ring is the polynomial 0, and the unit element is the 
polynomial 1 (both the polynomials are of zero degree). 


Examples (b) and (c) share, in fact, many common properties. One, for exam- 
ple, is the existence in both the rings considered of division with a remainder 
term of the number a by b or the polynomial a(x) by b(x) (where |a|>|b| 
and deg a(x) > deg d(x), respectively; by deg f(x) we denote the degree of the 
polynomial f(x)). This division is represented, respectively, by the equations 
a = ub + r, where|r| < | b | and a(x) = u(x) b(x) + r(x), where deg r(x) < 
deg b(x). Here the number u (resp. the polynomial u(x)) is the quotient of the 
division of a by b (or a(x) by b(x)), and the number r (the polynomial r(x)) is 
the remainder (the remainder of division may turn out to be 0). 

The procedure of division with a remainder can be used to find the greatest 
common divisor (gcd) of two numbers or polynomials. Thus, for instance, re- 
stricting ourselves to the case of positive integers a and b and denoting by (a, b) 
the gcd of these numbers, we find consecutively: 


a 


=ub+r, wherer <b and (a, b) = (b, 1); 
b=ur+r, wherer, <r and (b,r) = (r,1r,); 
r=tWr, + rs, wherer, <r, and (r,7r,) = (ry r2)3 


Pe-g = Urea + re, Where re < re-y and (res, re_y) = (rk_y, rb); 


rp. = Ukayre and, hence, (rz_1, rk) = rx. 
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Thus, r; is also the number 
= {a, b). 


It is important to note that the number d = (a, b), determined by the indi- 
cated methodf, can be represented in terms of the original numbers a and 5 as 


d= Ma + Nb, (*) 


where M and WN are some integers. In fact, from the equations set forth above 
we successively obtain 


r=1xXxa +(-u xb (=m Xa4+n x BD), 
mh=1lxb +(-u4) xr (=m Xa+n, x d), 
rmn=1xXr +(—u) Xr, (=m xXat+n, X d),..., 
re = 1 X rpg -k (--uk) X re-y (= M Xa+N x 5B), 


where all the numbers m and 7 (i.e., 1 and --u), m,, and n, (they are equal to 
—u,and 1 + wu,),m, andn,,..., Mand N are integers. 

From the formula (*) it follows in particular that in p-arithmetic (where p 
is a prime) every number a + 0 has an inverse. In fact, if0 <a < p, then ob- 
viously (a, p) = 1, and hence 


= (a, p) = Ma + Np. 


Thus, the product Ma (= (—N) X p + 1) when divided by p yields the remain- 
der 1." But this also implies that a number m of a p-arithmetic corresponding 
to M (the remainder from the division of M by p) is the inverse of a: in multi- 
plying numbers by p-arithmetic rules we have ma = 1 and, hence, m = a, 
Exactly the same procedure enables us to find the gcd of (a‘x), b(x)) of two 
polynomials a(x) and b(x) and show that if (a(x), b(x)) = d(x), then 


d(x) = M(x) X a(x) + N(x) x d(x), (**) 


where M(x) and N(x) are some polynomials. 

The analogy between a ring of integers and a ring of polynomials (with 
coefficients from any field F) can be characterized differently also. A subset J 
of elements of an arbitrary ring K is called an ideal of this ring, if 


—+This procedure of determining the gcd of aand bis called the Euclidean division algorithm; 
a ring, in which this procedure is valid (in particular, a ring of integers or a ring of polyno- 
mials) is sometimes called a Euclidean ring. 
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(i) J is a subgroup with respect to the addition operation in K; 
(ii) for each a in J all products ak, where k is some element of K, also belong 
to J. 


A typical example of the ideal of a ring of integers is a set of all numbers 
divisible by an arbitrarily chosen integer i (i.e. of all numbers of the form ai, 
where a runs through all integral values). In analogy to this, an example of an 
ideal in a set of polynomials is a set of polynomials that are divisible by an 
arbitrary preassigned polynomial i(x) (i.e., a set of polynomials of the form 
a(x)i (x), where a(x) is an arbitrary polynomial). An ideal so constructed is 
called a principal ideal of a ring of integers (resp. polynomials) generated by a 
number i (resp. a polynomial i(x)). 

We make the following statement that reveals the deeper combined character- 
ristics of rings of integers and polynomials. 

In a ring of integers or in a ring of polynomials every ideal J is a principal ideal, 
1.€., it consists of all possible multiples of a fixed integer i (corresp. polynomial 
i(x)). 

The proof of this statement presents no difficulty. Indeed, it is as a matter of 
fact possible that an ideal of a ring of integers consists of just a single number 
0 (for this one-element set all the conditions defining an ideal are obviously 
satisfied), but in such a case this is a principal ideal generated by the number 0. 
If, however, this is not so, then we denote by ia least number in absolute magni- 
tude, different from zero, that belongs to an ideal J (for simplicity we may agree 
to consider, say, that i > 0). It is now required to show that every other non- 
zero number b belonging to J is necessarily a multiple of i. Since |b| D> i, bcan 
be partitioned by i: 


b=ai+r, whereOd<r<i. 


However, since J is an ideal, together with b and i, the numbers ai, —ai and 
r = b + (—ai) also belong to it. Hence r = 0 (because i is a Jeast number in 
absolute magnitude, different from zero, belonging to J) and, hence, b = ai. 

For the case of a ring of polynomials the statement is proved in exactly the 
same way; here it is necessary only to take i(x) as a nonzero polynomial of 
least degree, belonging to the ideal J. 


We now pass on to further examples of a ring. 

(d) We have already seen that a qg-arithmetic with addition and multiplica- 
tion defined for it is a ring of q elements (a finite ring or a ring of finite order 
gq). Moreover, if qg is prime, then our ring isa field. 

(e) We noted above that a collection of polynomials of degree < n, where 
nis a fixed number, is a group with respect to addition (a finite group, if the 
coefficients of the polynomials are elements of a finite field). However, such 
polynomials do not form a ring because the degree of the product of two 
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polynomials is in general higher than the degree of either of the factors. In 
order to transform a collection of polynomials of degree < n into a ring, we may 
proceed as follows. 

We choose a fixed (any one convenient to us ) nth degree polynomial Q(x) 
and agree to replace every polynomial by the remainder of its division by Q(x); 
the degree of this remainder is then obviously < n. Thus we arrive at a ‘Q(x)- 
arithmetic’ of polynomials, which contains no polynomial of degree > n. In 
particular, the ‘product’ of two polynomials, understood in the sense of ‘Q(x)- 
arithmetic’, always has degree < n. A Q(x)-arithmetic is always (i.e., for any 
choice of the polynomia! Q(x)) a ring: it is a finite ring, if the field of coeffi- 
cients of polynomials is finite. If the field F of coefficients is of order p and 
deg Q(x) = n, then the ring under consideration is of order p". 

We give below multiplication Tables 8 and 9 for four polynomials of degree 
< 2 with coefficients from a 2-arithmetic in (x? + x)-arithmetic and (x? -+ x 
+ 1)-arithmetic. 


TABLE 8 
x 0 1 x x+1 
0 0 0 0 0 
I 0 1 x x+1 
x 0 x x 0 
x+1 0 x+1 0 x+1 
TABLE 9 
x 0 1 x x+1 
0 0 0 0 0 
1 0 1 x x+1 
x 0 x x+1 1 
x+1 0 x41 1 x 


A comparison of these two tables is instructive. The last two rows of Table 
8 do not contain the number 1, implying that in (x* + x)-arithmetic the poly- 
nomials x and x + 1 have no inverse. In contrast, in Table 9 all rows contain 
1, except the single first row which consists only of zeros. This means that in 
(x? + x + 1)-arithmetic all polynomials different from zero have an inverse: 


It =1,x%=x+1 and + 1)1?=x, 
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Thus, whereas an (x? + x)-arithmetic of polynomials with coefficients from a 
2-arithmetic is only a ring, an (x? +- x + 1)-arithmetic of polynomials with co- 
efficients from the same field forms a field. It is not difficult to comprehend 
the reason for this distinction. The polynomial Q(x) = x* + x is composite, 
it is partitioned into two factors of degree one: 


x?+ x = x(x 4+ 1). 


This implies that (x? + x)-arithmetic cannot form a field (this is, for example, 
implied by the fact that here x(x + 1) = 0). Onthe contrary, the polynomial 
P(x) = x2 +x +1 is prime (or, as we often say in algebra, is irreducible); it 
cannot be partitioned into factors of degree > 1. This, in turn, directly implies 
that in a P(x)-arithmetic every polynomial a(x) ~ 0 has an inverse; the proof 
of this fact is based on- formula (**) on p. 376 and is quite similar to the proof 
of the fact that in a p-arithmetic, where p is prime, every number a has an 
inverse. 
Thus we arrive at one more example of a field. 


C. If P(x) is an irreducible polynomial with coefficients from a certain field 
F, then the P(x)-arithmetic with coefficients in F forms a field. If F is a finite 
field F, described above of order p (where p is an arbitrary prime number) and 
deg P(x) = n, then the field obtained is of order p”. 

P(x)-arithmetic is not only an important example of a field — it also admits a 
different interpretation. Let us recall the formation of a field of complex 
numbers F, by an extension of a field of real numbers F,. It is a basic fact 
that the equation x? + 1 = 0 cannot be solved within the field F,. To make 
this equation solvable we extend the field F, by adding a new element j that 
denotes a (non-existing) root of the considered equation. In other words, we 
agree that i?-+ 1=0. The field F, that contains all real numbers and also 
the element i must obviously contain all binomials a + bi, a and 5 being arbit- 
rary real numbers. However, the powers of i can be easily eliminated; we 
know that i? + 1 = 0 and, therefore, an arbitrary polynomial 7(i) in i with 
real coefficients can be replaced by the remainder of the division of T(i) by 
2+ 1. In fact, if TG) = ti) G@ + 1) + r@), then T(i) = r(i) within the field 
F,. But r(i) is also a binomial of the form a+ bi. Therefore, our field F, 
consists of all binomials a + bi with the usual addition and multiplication 
supplemented by the rule that every polynomial in i must be replaced by its 
remainder after division by i? + 1 (i.e., * must be replaced by —1). This cons- 
truction leads to the customary field of complex numbers which is, of course, 
equivalent to (x? + 1)-arithmetic. 

A quite similar procedure can be applied to obtain a new interpretation of 
the arbitrary P(x)-arithmetic. Suppose that P (x) is an irreducible mth degree 
polynomial with coefficients in a field F. Then, the equation P (x) = 0 is, of 
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course, insolvable within F (otherwise P(x) would be reducible). Let us ex- 
tend F such that this equation becomes solvable within the extended field F*. 
For this, we must add to the field Fa symbol « which is the root of the con- 
sidered equation (so that by definition P(«) = 0). Since F* is a field, it must 
include also all polynomials in a with the coefficients in F. However, in the 
case in which the degree of the polynomial 7(«) exceeds n this polynomial can be 
replaced within F* by the remainder after division of T(«) by P(«). Therefore, 
we can consider that the field F* consists of all polynomials of the form a) + 
aja+...-+ Gn—-yx"-!, where ay, a,;,... 5 Gn-1 are the elements of F, with the 
usual addition and multiplication supplemented by the replacement of the 
obtained product by the remainder of its division by P(«). It is clear that the 
field F* is equivalent to P(x)-arithmetic. 

It can be shown that for every prime p and for each k > 1 there exist 
irreducible kth degree polynomials with coefficients from a field F,; hence, it 
follows that for every integer k > 1 and every prime p there exists a finite field 
of order p* (a field of order p' = p is formed by a p-arithmetic itself). Moreover, 
although there may exist many irreducible polynomials P(x) of a given degree 
k with coefficients from a field F,, all P(x)-arithmetics corresponding to them 
are constructed alike: for every prime p and each k 31 there exists just a 
single (to within the rearrangement of elements) field of order p*. If, however, 
the integer m does not assume the form p* (i.e., if m contains at least two dis- 
tinct prime factors), then a field of order m does not exist at all.} 

Before we conclude we note further that since the Q (x)-arithmetic is obtained 
from a ring of all polynomials (with coefficients in some chosen field F) by 
‘coalescing’ all polynomials that yield one and the same remainder when divid- 
ed by Q(x), the ideals of a Q(x)-arithmetic are also obtained from the ideals 
of a ring of all polynomials by similarly identifying all those polynomials of an 
ideal that yield the same remainder on division by Q(x). This, in turn, implies 
that the ideals of a Q(x)-arithmetic are constructed in the same manner as are 
the ideals of a ring of all polynomials : here also every ideal is a principal ideal 
(i.e., it consists of all polynomials that are multiples in the sense of the Q(x)- 
arithmetic of some fixed polynomial i(x)). However, in this connection it is 
necessary to keep in mind, as is easy to perceive from formula (**) on p. 376, 
that a set of all those polynomials taken in the sense of a Q(x)-arithmetic 
that are multiples of a given polynomial i(x) coincides with a set of all poly- 
nomials that are multiples of a polynomial d(x), where d(x) = (Q(x), i(x)) is the 
‘ged of the polynomials i(x) and Q(x). Hence, it follows that, for an irreducible 
(prime) polynomial Q(x), a Q(x)-arithmetic contains no ideal other than 0 and 


tThus, a fie!d of finite order m exists if m = p*, where p is some prime number, but does 
not exist for all other » numbers. Moreover there is only a single field of order p* for every 
prime p and positive integer k. All these fie'ds are due to Evariste Galois (1811-1832), the 
noted French mathematician, and are hence called Galois fields 
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the entire ring (the entire Q(x)-arithmetic itself); in fact, here the ged of Q(x) 
and i(x) is either 1 or Q(x). If, however, a polynomial Q(x) is reducible, i.e., 
if it is divisible into factors whose degree is less than deg Q(x), then a set of all 
polynomials that are multiples of each factor of Q(x) forms an ideal of a Q(x)- 
arithmetic. Thus, by way of example, in the case of a (x? + x)-arithmetic over 
a 2-arithmetic a set of all ideals consists of ‘zero ideals’ {0}; the entire (x? + x)- 
arithmetic; a set {x, 0} of polynomials that are multiples of x, and a set {x + 1, 0} 
of polynomials that are multiples of x + 1 (See Table 8 on p. 378). 


3. We shall now enunciate one more algebraic concept that is found to be 
useful in coding theory. 

A set V of elements a, b, c,... (called vectors) forms a vector space over a 
field F (the elements of a field are called numbers; the null and unit elements of 
a field are denoted below by the symbols 0 and 1), if 


(i) for the set of vectors the operation of addition is defined such that the 
vectors form a group (the null element of this group is denoted by 9); 
(ii) the operation of multiplication of a vector by a number is defined; more- 
over, the product aa (where a is a number and a a vector) is a vector; and 


VSI: the multiplication of a vector by a number is associative : a(ba) = (ab)a 
for all numbers a, b and every vector a; 


VS2 : the multiplication of a vector by a number is distributive relative to the 
addition of numbers : 


(a --+ b)a = aa4- ba for all numbers a, b and every vector a; 


VS3 : the multiplication of a vector b y a number is distributive relative to the 
addition of vectors : 


a(a + b) = aa-+ ab for every number a and all vectors a, b; 
VS4 : la = a for every vector a. 


From the properties (axioms) of multiplication of a vector by a number it is 
easy to deduce also that 


0a =0 for every vector a; a0 = 0 for every number a; (—1)a = —a for 
every vector a. 


Examples 
A. Blocks (or vectors) a = (dp, a;,..., @n-1), where N is a fixed natural 
number and ao, a, ..., @n-, are arbitrary numbers in a field F, form a vector 


space with respect to the operations of addition of vectors and the multiplication 
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of a vector by a number, which are defined as follows: if 


@= (4, a,..-,4n3) and b= (bo, by,..., bv), 


then 
a+ b= (ay 4+ bo, a, + b,,..., Qv—-1 + by-1); 


if a= (@, ai,..., @y-1), then aa = (ado, aa;,..., Gay-). Here F is called 
a field of scalars or a basic field, over which the vector space V is constructed; 
the numbers dp, a, ..., Gy-; are called the coordinates of the vector a and the 
number N the dimension of V. 

If F is an infinite field, then the number of all possible vectors is also infinite; 
if F is, however, of order m, then the vector space V of dimension N (N-dimen- 
sional vector space) contains only m* distinct vectors. 

This example is central; the other examples always reduce to it. 


B. The vectors (directed segments) of a plane or the usual (three-dimensional) 
space form a vector space with respect to the operations of addition of vectors 
and the multiplicaton of a vector by a (real) number, defined as follows : 


OA + OB = OC 


if OC is a diagonal of the parallelogram OACB, constructed on the segments 
OA and OB; 


OD =ax OA 
if the segments OD and OA belong to the same straight line; 
OD =|a| x OA 


and D and A lie on the same side with respect to O if a > 0, but on opposite 
sides if a < 0. 

‘Example B reduces to the central Example A if we introduce in the usual 
manner the coordinates (x, y) of the vector OA of a plane (Fig. 38a) and the 
coordinates (x, y, z) of the vector OA of a space (Fig. 385). It is also found 
that in the case of vectors of a plane, if a = (x, y) and b = (x, y;), then 


a+b=(x+%, y+ y:) and aa = (ax, ay); 
in the case of vectors of a space, if a = (x, y, z) and b= (x, yi, 2), then 
at+b=(x+x,y+y, 2+ 2) and aa = (ax, ay, az). 


Thus, the vectors of a plane form a two-dimensional vector space, and the vectors 
of a space—a three-dimensional vector space over a field of real numbers. 
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(a) (6) 
Fig. 38. 

C. It is obvious that a set of all (m xX n)-matrices with elements in field F 
forms an (mn)-dimensional vector space over F if the addition of matrices is 
defined as above, and the multiplication of a matrix by a number a is defined 
as the multiplication of all elements of the matrix by this number. This exam- 
ple differs from the basic Example A only in that here the mn coordinates of a 
vector aré written not in a single row, but in the form of a rectangular matrix. 


D. All polynomials of degree less than n 
Gy + A,X + Gox? 4-22. + Api x} 


with coefficients in a field F form an n-dimensional vector space over F. In fact, 
every polynomial can be characterized by its coefficients a), a,,.. . , @n—1 (which, 
if considered more convenient, can be set out within parentheses), and the 
(ordinary) addition of polynomials and the multiplication of a polynomial by 
a number reduce, respectively, to addition of the coefficients of two polynomials 
and multiplication of the coefficients of the polynomial by a number. 

Since P(x)-arithmetic, where P(x) has degree n, consists of all polynomials of 
degree less than n, it is clear that P(x)-arithmetic also forms an n-dimensional 
vector space over the coefficient field F. Note that the field F is, in fact, a collec- 
tion of all polynomials of degree zero (i.e.. constants) and hence F is a subfield 
of P(x)-arithmetic. It is also possible to prove that, if F is an arbitrary field 
and Fo is its subfield (i.c., a set of the elements of F forming a field with respect 
to the operations defined in F), then F necessarily forms a vector space over the 
field F,. A finite field having characteristic p includes evidently a subfield Fy 
consisting of p elements 1, 1+1=2,24+1=3,...,(p—1)+1=0. 
Therefore, the field F can be represented as a vector space over F, and, of 
course, this vector space must be finite-dimensional. This fact implies the result 
stated above that a finite field of characteristic p must have p* distinct elements. 


E. All possible polynomials 
Gy + ayx + a,x? +... + a,x" 
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(their degrees are now unrestricted) also form a vector space with respect to the 
operations of ordinary addition of polynomials and the multiplication of a poly- 
nomial by a number. This example, however, does not coincide with Example 
A, because the number of coefficients of a polynomial can be arbitrarily large; 
hence we say that a space of all polynomials does not have a dimension (some- 
times we say differently that it has an infinite dimension). 

We now suppose that W is some portion of the vectors of a vector space V. 
If this set W satisfies the following conditions : 


SS1 : if the vectors a, b belong to W, thena + 5b also belongs to W; 


SS2 : if a belongs to W, then all vectors aa also belong to W, where a are all 
possible numbers, 


then W is itself a vector space with respect to the operations (defined in V) of 
addition of the vectors and the multiplication of a vector by a number. In this 
case we say that W is a (linear or vector) subspace of a vector space V. 

In particular, if V is a set of vectors OA of an ordinary space, and W is a 
plane that passes through the point O (Fig. 39), then the vectors OB belonging 


Fig. 39, 


to W form the subspace of a three-dimensional vector space. If V is a set of all 
n-dimensional vectors 


@ = (4), G2,..., Qn), 


then a subspace is formed by a set W of vectors a, whose coordinates satisfy a 
fixed equation of the form 


bay + bos + vee + bnQ@n = 0, (5) 


Where b,, bo,... , bn are arbitrary fixed ‘numbers’, i.e., are elements of the field, 
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to which the vector coordinates belong. In fact, it is easy to see that if the num- 
bers a,, @),..., Gn and aj, a,,..., Gn Satisfy the relation (5), then the num- 
bers a, + ay, dg + a2,...,4n, + a,dosoalso; similarly, if the numbers a,, a, 

. , Qn Satisfy (5), then relation (5) holds also for the numbers aa,, aa,,... , An, 
where a is an arbitrary number. It is also quite simple to demonstrate that a set 
W of vectors a, whose coordinates satisfy the system of equations 


bya, + bi24, + eee + binan = 0, 7) 
be) + booQs + oer +- bandn = 0, | 
ge rar ia ae ee aae e bs (6) 


sift sy ieee ere as tanta wR cc ee GO age BN 
bmi@, + bm2Qe + aoe + binnOn = 0, J 


forms a vector space. It is also shown in the textbooks on linear algebra that 
every linear subspace W of an n-dimensional vector sapce V can be defined by 
a system of the form (6) (possibly, by relation (5) alone), connecting the co- 
ordinates of vectors belonging to this subspace. In particular, the vectors of a 
three-dimensional space V, belonging to a fixed plane W, may be determined 
by the condition that their coordinates x, y, z satisfy the relation 


byx + bey + byz = 0, 


where (ba, bo, bs) are the coordinates of an arbitrary vector 5 perpendicular to 
a plane W (Fig. 39). 

Let us present some more examples of vector subspaces. 

A set of all polynomials of degree < n forms a linear subspace of a space of 
all possible polynomials. If k < x, then a set of all polynomials of degree < k 
forms a Subspace of a space of polynomials of degree < n. A set of all polynom- 
ials of the form 


a(x) = g(x) b(x), 


where g(x) is a fixed and b(x) is an arbitrary polynomial, makes up a subspace 
of a space of all polynomials (and if g(x) is of degree k but the polynomials 
b(x) are of degree < n — k, then this set is a subspace of a space of all poly- 
nomials of degree < n). 

We further note that, in the case of a vector space over a field of numbers 0 
and | (over a 2-arithmetic), a verification of the fact that some set of vectors 
forms a linear subspace of the original space reduces to checking the property 
SS1 (because here we have no numbers other than 0 and I, and the vector 
0 X acan always be represented in form of the suma + a). Thus, here every 
subspace of a vector space coincides with a subgroup of a group of vectors with 
respect to addition. Jt is not difficult to show that exactly the same situation 
obtains in the case of a vector space constructed over every p-arithmetic, where 
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Pp is prime; however, in the case of a base field different from p-arithmetic (say, 
when P(x)-arithmetic figures as a base field, where P(x) is an irreducible poly- 
Nomial), there exist subgroups of a vector space such that they are not its 
subspaces. 

The notion of linear dependence of vectors is related to the notion of linear 
subspace of a vector space. The vectors ay, a,,..., @n ofa vector space V are 
said to be linearly dependent, if there are numbers (i.e., the elements of the 
basic field F) Ag, A, ,..., A, such that not all of them are equal to zero and 


Agdo + Aya, +... + Anda = 0. (7) 


Conversely, if the numbers Ag, A,,..., 4, satisfying the above conditions do 
not exist, the vectors a), ai,..., @, are said to be linearly independent. It is 
easy to show that, if a,,...,@n aren linearly independent vectors of a vector 
space V, then a set of all the vectors ag such that n + 1 vectors dg, a;,..-.,4n 
are linearly dependent forms an n-dimensional linear subspace of V. In fact, 
it is clear that the coefficient A, of equation (7) is here necessarily different from 
zero for all vectors ay (Since, otherwise, the vectors a,, ... , an would be line- 
arly dependent). Now, if we multiply equation (7) by Aq! and write 


= A; 
we = —AAQ) = he! 
we obtain 
Ag = ya, -+... -F Undn. (8) 
Therefore, a set of vectors ag, satisfying the condition that ag, a,,..., @n are 
linearly dependent vectors coincides with a set of vectors of the from (8), where 
the coefficients »,,..., 4, run through the field F. It is clear that the last vec- 


tor set satisfies conditions SS! and SS2 which define a linear subspace. More- 
over, a ‘block’ of m numbers m = (u,, U»,..., Un) can be associated with every 
vector a, while two different vectors a{!’ and a{?) correspond to any pair of 
distinct blocks m and m'?), (In fact, if 


a= eat... 4+ pMan = pa, +... + pan, 


where not all the differences y{’) — u{2’ are equal to zero, then pa, +... 
+ Unda = 0, i-e., the vectors a,,..., an are linearly dependent, contradicting 
our assumption.) This shows that a linear subspace of all vectors ag is really n- 
dimensional. (In particular, if F is a finite field of order g, then the number of 
all blocks m and, therefore, also the number of vectors dp is equal to q". This 
again shows that the considered subspace is of dimension n.) 
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It follows from the result stated that every system of (N + 1) veetors ay, ao, 
.-. Qy+, of an N-dimensional vector space V is necessarily linearly dependent. 
In fact, if it were not so, then a set of all vectors ag such that do, @;, @z,.--> 
@y+; are linearly dependent would be an (N + 1)-dimensional linear subspace. 
However, it is clear that an N-dimensional vector space V cannot have an 
(N + 1)-dimensional linear subspace. (In particular, if the basic field Fis of 
finite order, then the total number of all vectors of V is insufficient to form an 
(N +- 1)-dimensional subspace.) As a specific example, let us mention P(x)- 
arithmetic over a finite field F, where P(x) is an irreducible polynomial of order 
n. We know that this P(x)-arithmetic forms an n-dimensional] vector space over 
F. It is, therefore, clear that every system of (# + 1) elements of a P(x)-arith- 
metic must be linearly dependent. 


From the concept of a vector space it is easy to pass on to the main geometric concept of 
Euclidean space. To be precise, an N-dimensional vector space E is called Euclidean, if in it is 
defined the length | a |g (or, simply | @ |) of a vector a with coordinates (Gy, a, .. . , 4-1): 


late= qf abt at + ha (*) 


(Obviously, the basic field here must be such that there exists in it a square root of the sum 
of the squares of any pair of elements of a field.) Further, if we agree to call the vectors of 
Euclidean space ‘points’, associating with the null vector 0 some point O and with the vector 
aa point A with the same coordinates, and also to write a = OA, then the distance | AB|E, 
or simply | AB| between the points A and B is defined by 


| AB| = |OB — OA| = V/ (by — 4)" + (6, —@))? +... + Ona — Onn), (++) 


where (a), 4,,... ; @y_1) and (Bo, by,..., By-1) are coordinates of A and B (i.e., of vectors 
OA and OB). This permits us to characterize the subject-matter of Euclidean geometry as a 
description of those properties of figures (i.e. sets of points) in the Euclidean space E that are 
identical for every pair of equal figures (where the equality of two figures is defined by the 
condition of equality of distances between any pair of points of these two figures» correspond- 
ing to each other). 

A Euclidean space with the real coordinates of points (and vectors) is an example of a 
metrie vector space. A set M of points is called a metrie space if for every pair of points A 
and B there is defined a (real) number p43, called the distance between A and B, and 


MS1 : 4p > O for A # B, P14 = O(positiveness of distance); 
MS2: Pap = PBA (symmetry of distance); 
MS3 : P4a + Pac > Pac for every A, Band C (triangle inequality). 


If the number p43 = | AB |e is defined by the formula (++), then the conditions MSI and MS2 
are obviously satisfied. It may not be that simple to establish MS3, i.e., the inequality 


A (bo — 4)? + (by — a)? +... + Ona — ON)" 
+ lo — 5°)? + (e; — By)? +... + (ey-y — On-1)? 
? W (eo — @)* + (€; — 4)? +... + (Cw — ay_,)*, 
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yet this does not present any singular difficulty.t 

There are also many other methods of introducing a ‘metric’ in an N-dimensional vector 
space. Thus, for instance, in many respects the so-called ‘Minkowski metric’ff is much simpler 
than the Euclidean metric (+)—(++). This ‘Minkowski metric’ is given by 


[ely = 14) + lal... + lanesl, (A) 
and 
| AB |M = |b) —a,|/+|b,—a;/+...4|by-1— aya (B) 


where | a| is the absolute value of a (real) number a. Equation (B) implies directly that the 

distance p4p = | AB |M also satisfies the conditions: MS!-MS3. ‘ 
The metric (A)-(B) can be defined for a vector space constructed over any basic field F for 

which there exists an absolute value of an element a in the field, a real number | a | such thatttt+ 


(i) |a{ > O fora # 0;|0| =0; 
(ii) [ab] =| a| x |b); 
(iii) [a+b] <|a[ + [0d]. 


In particular, if the base field is a 2-arithmetic and the absolute value of an element in the 
field is defined by the usual equalities 


}O;=90, |1[=1 


(where 0 and 1 on the right-hand sides again occur as rea! numbers), then the metric defined 
by equations (A) and (B) above is called the ‘Hamming metric’: 


|afH = |ag{+fa,]+...+{any-al, 


| AB lH = [by —45| + |b — ay] +... + [bya — ay_s |. 


It is plain that if the points A = (a),a,,...,@y-,) and B= (by, b,..., by_,) of an N- 
dimensional space with coordinates in 2-arithmetic correspond to two sequences of signals, 
then the distance | AB|H is equal to the number of noncoincident signals in the sequences A 


fSee, for example, Kuiper, N. H. (1963), Linear Algebra and Geometry (pp. 131-132), North 
Ho!land, Amsterdam; or Halmos, P. R. (1958), Finite-Dimensional Vector Spaces, § 64, Van 
Nostrand, Princeton. 
ttH. Minkowski, the German mathematician, in his researches on number theory, -has 
considered more general methods of introducing a metric in an N-dimensional vector space, 
encompassing both the formulae (++) and (B). 
tttThe symbols 0 appearing here on the left-hand and right-hand sides of the equalily 
0 | = 0 have somewhat different senses: the one on the left-hand side is an e/ement of the field 
under consideration, the other on the right-hand side is simply a rea/ number. A similar remark 
can be made in connection with certain other equalities below, 
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and B. This fact explains the usefulness of the Hamming metric in coding theoryt. In addi- 
tion, from the triangle inequality it follows that a pair of ‘Hamming spheres’ of radius n_with 
centres Q, and Q, (i.¢., a set of points A such that | Q,A |H < nand|Q.A4|H < a, respec- 
tively; see p. 338) cannot intersect if Q;}0, > 2n (we availed ourselves of this fact on p. 338). 

We further note that if sequences A(a), a,,... , @4y_1), where all a, take the values 0 and 
1, are represented by the points of an ordinary (real) N-dimensional space (these points are 
the vertices of a ‘unit cube’ of the N-dimensional Euclidean space), then obviously 


|AB |e = VAD In. 


Hence the Euclidean distance | AB |E between the points A and B, defined by formula (++), 
can serve as.a completely satisfactory characteristic of the difference between the sequences 
A(a, 4,...»4y_,) and B(by, b,,..., 6-1) of the elementary signals. This position en- 
ables us to use in communication theory the results related to (N-dimensional) Euclidean 
geometry. In the first place, the conclusions from the so-called discrete geometry are here 
useful, since discrete geometry deals especially with the problems of ‘closest packing of disjoint 
equal spheres’ in a many-dimensional space and the problems of determining those configura- 
tions of a finite number of points located in a given domain of a space, for which the least 
Pairwise distance between these points is the greatest. 

In particular, the problem of determining all binary codes, where the coded messages are 
sequences of N elementary signals, correcting any number of errors not exceeding n, reduces to 
the problem of determining all possible fillings of a ‘unit cube’ of an N-dimensional Euclidean 
Space with disjoint spheres of radius /n and centres at the vertices of the cube. By what has 
been stated, the problem of finding such fillings of an N-dimensional cube with spheres of a 
given radius, where the number of spheres involved is the /argest possible (or, is at least 
sufficiently large), is of great interest in coding theory. However, any perspective geometric 
approach to the solution of this problem is unfortunately still an open problem. 


4. In linear algebra, an important role is played by the operation of matrix 
multiplication, a special case of which is the multiplication of an(m x n)-matrix 
by an (n x 1)-matrix (by the column vector): 


TIn the case in which a basic field F contains more than two elements, Hamming metric 
is defined by the same equations (A) and (B) as above, with the difference however that in the 
present case it is necessary to set 


0, ifa=0 
ja] = ; 
1, ifa # 0. 


Here the Hamming distance | AB |x is as before equal to the number of noncoincident signals 
in the sequences A and B. 

We also remark that in coding theory, besides the ‘Hamming distance,’ some other metrics 
in a space of sequences of signals are also used. As an example, we may mention the so- 
called ‘Lee metric’, which coincides with the ‘Hamming metric’ in the case of a field F of two 
elements; however, in other cases it takes note not only of the fact that some coordinates of 
the points A and B do not coincide, but also of how greatly these coordinates differ from cach 
other (see, ¢.g., [190], Chap. 8.2). 
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by1543 eee Bin ay 

baPse wee ben a, 
Ba =F) eee ian, 

bmOme a ee ban an 


by@, + Byga, +... + Dina 
by, 41 + Dyeda +... + Dann 


oor nme eee ees coe wo we ow ow 


bmyay + bmad3 +...+ Dan@a 


Obviously, in this product we can also write a vector a with coordinates a,, 
@,,...,4n in the form of a row vector: a = (a), a:,..., @n) although this 
is not conventional in linear algebra. It is then possible to let relations (6) on 
p. 385 assume the form 


where 0 is the null column vector of zeros. 

For certain branches of linear algebra, we also find essential the concept of 
elementary transformation on a matrix, by which we understand here the follow- 
ing transformations: 


(i) the interchange of any two rows of a matrix; 
(ii) the interchange of any two columns of a matrix; 
(iii) the replacement of any row of a matrix by its sum with any other row 
(where the sum is understood as the row vector sum). 


The matrices obtained from each other by a finite sequence of elementary 
transformations are called equivalent. 

The indicated elementary transformations} are intrinsic especially to the 
parity-check matrix of a code. In fact, the interchange of matrix columns and 
rows here reduces to the renumbering of signals and checks used, respectively. 
However, the replacement of some row by its sum with another row implies 
that in place of two parity checks we check the parity of the first of the two 
used expressions and the sum of this expression with the second one. It is 
obvious that two such checks are completely equivalent to the original checks. It 
is also easy to establish further that by a sequence of elementary transforma- 
tions each check matrix can be reduced to the form (2) on p. 318 (or equi- 
valently, to the form differing from (2) only in that it is augumented by some 


fIn different problems of linear algebra, different elementary transformations are, in fact, 
found suitable. ; 
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additional rows made up of zeros; these rows obviously do not correspond to 
any new check and hence can be ignored). In fact, a zero row is of no interest; 
if such a row happens to exist already in the matrix, we can make it the top- 
most by transformation (i) and act analogously even in a case in which, in the 
process of transformations on a matrix, a new ‘zero’ row makes its appearance. 
We now consider the lowest row. It is clear that the element | appearing in 
it can be transferred by means of operation (ii) to the extreme right of the 
column. Thus adding this row to all rows in the last column of which 1 occurs 
and noting that in 2-arithmetic 1 + 1 = 0, we can convert into zero all ele- 
ments in the last columns, except for only a single 1 occurring in the last row. 
If after this the second row from below is found to consist of only zeros, we 
shift it upward. However, if it also contains at least a 1, then by operation (ii) 
we transfer it to the next to last column, and then by operation (iii) convert 
into zero all other elements in the next to last column. We next pass on to the 
third from last row and by iterating the same operations we endow the third 
from last column with the desired form, and so on. As a result, we obtain a 
matrix of form (2), possibly with only the same rows supplemented from above 
which include only zeros. 

The applications of this result to the parity-check matrix of the code have 
demonstrated that every parity-check code can be written in the form of a system- 
etic code, the number of parity checks in which may, however, be less than those 
in the original ‘non-systematic’ code (see p. 319 and Example on p. 336). 
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Appendix IV 


SHORT TABLE OF THE FUNCTION, A(p) = —p log p — (1 — p) log (1 — p) 


P h(p) P h(p) 
0.005 0.045415 0.130 0.557438 
0.010 0.080793 0.135 0.570993 
0.015 0.112364 0.140 0.584239 
0.020 0.141441 0.145 0.597185 
0.025 0.168661 0.150 0.609840 
0.030 0.194392 0.155 0.622213 
0.035 0.218878 . 0,160 0.634310 
0.040 0.242292 0.165 0.646138 
0.045 0.264765 0.170 0.657705 
0.050 0.286397 0.175 0.669016 
0.055 0.307268 0.180 0.680077 
0.060 0.327445 0.185 0.690894 
0.065 0.346981 0.190 0.701471 
0.070 0.365924 0.195 0.711815 
0.075 0.384312 0.200 0.721928 
0.080 0.402179 0.205 0.731816 
0.085 0.419556 0.210 0.741483 
0.090 0.436470 0.215 0.750932 
0.095 0.452943 0.220 0.760167 
0.100 0.468996 0.225 0.769193 
0.105 0.484648 0.230 0.778011 
0.110 0.499916 0.235 0.786626 
0.115 0.514816 0.240 0.795040 
0.120 0.529361 0.245 0.803257 
0.125 0.543564 0.250 0.811278 
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Appendix IV Table (contd. from page 395) 


P h(p) P h(p) 
0.255 0.819107 0.380 0.958042 
0.260 0.826746 0.385 0.961497 
0.265 0.834198 0.390 0.964800 
0.270 0.841465 0.395 0.967951 
0.275 0.848548 0.400 0.970951 
0.280 0.855441 0.405 0.973800 
0.285 0.862175 0.410 0.976550 
0.290 0.868721 0.415 0.979051 
0.295 0.875093 0.420 0.981454 
0.300 0.881291 0.425 0.983708 
0.305 0.887317 0.430 0.985815 
0.310 0.893178 0.435 0.987775 
0.315 0.898861 0.440 0.989588 
0.320 0.904381 0.445 0.991254 
0.325 0.909736 0.450 0.992774 
0.330 0.914925 0.455 0.994149 
0.335 0.919953 0.460 0.995378 
0.340 0.924819 0.465 0.996462 
0.345 0.929523 0.470 0.997402 
0.350 0.934068 0.475 0.998196 
0.355 0.938454 0.480 0.998846 
0.360 0.942683 0.485 0.999351 
0.365 0.946755 0.490 0.999711 
0.370 0.950672 0.495 0.999928 
0.375" 0.954434 0.500 0.100000 
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length 144 
average 151, 156-57 
Coding 122, 138, 249, 251-53, 255 
and statistical laws 148 
advantageous (efficient) 139 
algebraic theory of 317(fn), 318 
block 146, 152-53, 157, 289 
fundamental theorem of xvi, 147f., 157-63, 
172-73, 246, 274 
fundamental theorem of noisy (due to 
Shannon) xvi, 274-75, 289-306 
converse of xvi, 273, 283-90 
strongly converse of 290 
Huffman method of 155-57 
method 293, 307 
random 292-97, 301(fn), 306, 323 
Shannon-Fano method of 152-53, 155-56, 
158 
theory xvi, 305-07 
notion of entropy in 161! 


Codon 255-56, 258 


‘Comma’, as separating symbol 140 
Communication, specific content of 55 
Communication channel xii, xiii 
and statistical regularities 55 
associated with hereditary phenomenon 252 
binary 
asymmetric 273 
symmetric 263, 265, 275, 277, 280, 299, 
301 (fn), 302, 307-308, 311, 323, 342 
erasure 267, 269-70, 323 
capacity xiii, 246-51 
in absence of noise 173, 262 
in presence of noise 262-63, 272, 297 
zero-etror 283 
human organism as 249-51 
m-ary symmetric 266 
new forms of 249 
noisy 252, 260-62 
code selection for every 274 
mathematical model of 260-61 
non-binary 317(fn), 346 
transmission of speech over 219 
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Complement of a set 39 
Constant 
mean value of a 28 
number p | 
Convex k-gon 354 
Coset 368 
Counterfeit coin problems viii, 108-21, 136, 
147 
Current pulse 137, 139, 246, 251 
Cytoplasm 252-53 
Cytosine 253, 255 
Czech language 
entropy of 214 


Decimal number system 143 
Decimal system of logarithms xiii, xv 
Decimal unit (dit) xv, 46 
Decoding 140-41, 249, 251-53, 255 
Decoding error probability 293, 296 
mean 339-40, 342 
Hamming lower bound on 340 
Decoding, instantaneous 141 
Decoding method 293, 299, 306-07 
Decoding rule 322 
Decoding, sequential 251, 323-24 
Decoding, unique 140 
Deoxyribonucleic acid (DNA) 253-54 
molecules, four-letter alphabet 253 
Die 
imperfect 42 
throw [-2 
Disjoint equal spheres, closest packing of 389 
Dispersion (or spread) 27-34, 36 (see also 
Variance; Variance of random variables) 
Distance 387 
code 338 
Euclidean 389 
geometric 338. 
Hamming 338, 340, 342, 389(fn) 
utility in coding theory 389 
Lee 389 
Minkowski 388 
Distributive law 38 
Dit xv, 46 
Divisor of numbers, greatest common 40, 
365, 375 
Dravidian languages, entropy of 197, 214 


Ear, resolving power of 247 
Element(s) 
difference of 365 
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identity 364 
inverse of an 365 
symmetric 364-65 
unit 365, 371 


English alphabets 139, 192, 203 
English language 


average information in stressed words of 
217 

average word-length in 181, 188 

coding text-letters in, Shannon-Fano 
method 180 

digram frequency in 182-83, [86 

entropy of letters/words in 179-81, 187, 194 
estimation by Cover and King 202-03 

first-order approximation to 181, 186 

letter/word frequency in 178-79, 182, 186 

letter-guessing experiments for 191, 196 

redundancy for 185, 188, 191, 195-96 

second-order approximation to 183, 186-87 

statistical characteristics of letters/words 
186-88 

third-order approximation to 183 

trigram frequency in 183, 186 

zero-order approximation to 178 


English speech 


one phoneme redundancy for 220 


Entropy(ies) viii, xvii, 44ff., 47, 53-54, 74, 88, 


93f. 
addition law of 60, 63, 98 
as measure of uncertainty 44ff., 47, 56, 94 
combinatorial 215 
concept of xii, 55-56 
conditional 62-63, 181-82, 184, 188-90, 220 
e- 82, 230 
Hartley’s viewpoint of 53-56, 59, 125 
limit 185, 188-89, 207 
of compound experiment 61 
of experiment 47 
of language 177f. (see a!so under Various 
languages) 
method of determining 198 
of N-letter block 162 
of one letter 149 
of speech 2! Sf. 
of television imzges 231 
residual! 124-25 
Shannon’s approach to 53-57, 215 
specific 162, 171, 184, 198, 219 
thermodynamic concept of 47(fn.f) 
true, upper bound on 199 
unconditional 7} 
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Enzymes 253 
Equiprobability 3-4, 44-45 
Error(s) 
single 309, 311 
systematic 26, 32 
Error probability 
exponential bound of 281(fn) 
mean value of 284-86 
Event(s) 
addition of 7f., 36 
certain (sure) 8 
compatible 9 
contrary 8 
incompatible 7f. 
pairwise 9, 23, 45, 59 
impossible 8, 35 
independent 7f., 10-11, 20, 25 
multiplication of 7f., 36 
associative law of 38, 91 
obeying statistical law 82 
product of 10, 36 
random If., 4, 40, 44 
set of elementary 43 
. sum of 8-9, 14, 36 
Euclidean 
distance 389 
division of algorithm 376(fn) 
ring 376(fn) 
space 387 
European languages 
entropy values for 198 
letter frequencies in 197 
redundancy estimates for 185, 196 
spoken, phoneme statistics and entropies 
219 
Experiment(s) 1, 121-22 (see also under Letter- 
guessing; Urn) 
auxiliary 1, 122 
compound 42, 45, 55, 88-89, 123, 125, 232 
dependent 61 
equally probable outcomes of an 42-43, 45 
independent 45, 59 
simple 88 
weather 52 
Exponential mean 349-50 
Eye, resolving power of 231, 247 


Fano codes (see under Codes) 
Fano inequality 287-88, 303-04 
Field 369-71, 379 

algebraic, two-element 328 
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basic (or field of scalars) 382 
binary 317 

finite 317(fn) 

Galois’ 380(fn) 

order of 371 

primitive element of 372 
properties of 371 


N 
Formula for the number (x) 14-15 
French language 
digram probabilities in 194 
entropy of 195 
first-order approximation to 193 
letter frequency in 192-93 
letter-guessing experiments for 196 
phoneme statistics and digrams of 219 
redundancy estimates for 195 
trigram probabilities in 194 
word-length in 192 
Frequencies, limiting. of guessing a letter 189 
Frequency 
dictionary 186, 207, 217 
of occurrence of result 1 
Function(s) 
convex 287, 304, 347ff. 
graph of 351 
properties of 347ff. 
test for 346(fn) 
exponential 281(fn), 347 
logarithmic 347 


Gambling scheme, due Cover and King 201- 
03 

Game of ‘garbled telephone’ 89 

Gene, ‘initiation’ and ‘termination’ marks 
256(fn), 257 

Genetic information transmission (see under 
Information) 

Geometric distance 338 

Geometric mean 349 
theorem 304, 355 

Geometry 
discrete 389 
Euclidean 389 

German language 
‘first-order approximation" to 193 
letter frequency in 192-93 
letter-guessing experiments for 195-96 
redundancy for 195 
‘semantic’ information in, study by 

Kipfmiiller 218 
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German speech 
redundancy of one phoneme 220 
Golay perfect binary code 341 
Greatest common divisor 40, 365, 375 
Group 320, 364-65 
commutative 364 
cyclic 369 
multiplicative 365 
non-commutative 364(fn) 
null element of 365 
order of 369 
subgroup of a 320, 368 
Guanine 253, 255 
Guessing method viii, 148-49, 181, 188-92, 
195-98, 204, 206, 211, 225 
due Shannon 188-92, 195-98, 201-02, 212, 
236-37 
sharpening by Kolmogorov 198-99, 212, 
214 


Hamming 
distance 338, 340, 342, 389(fn) 
inequality 315, 328, 335-37, 339, 341 
lower bound 315-16, 336-37, 339 
metric 388-89 
utility in coding theory 389 
sphere 338, 340-42, 389 
upper bound 339 
Hebrew language 
redundancy for 197 
statistical structure of 197 
Hieroglyphic writing 207-08, 231 
Hungarian 
alphabets 203 
language, entropy of 203 
prose, information-theoretic characteris- 
tics of 214 
redundancy estimates for 203 


Ideal 332, 376 
principal 377, 380 

Images 
colour television 237-38 
detail-starved 235 
heterochromatic 235 
monochromatic 232, 235 
television 288f. 

Indian languages 197-98, 214 
redundancy estimates for 197-98 

Inequality 
Chebyshev 26f., 33-34, 299, 306 
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Fano 287-88, 303-04 
Hamming 315, 328, 335-37, 339, 341 
Jensen 352, 356 
Kraft 175 
McMillan 175 
Varshamov-Gilbert 316, 327-28, 335-37, 
345 
Information 
amount of xiii, 74, 215 
algorithmic approach to xvii 
combinatorial approach to xvii, 215 
average amount of 
in an experiment 74-75, 132 
in a text-letter 203 
in a word 207 
concept of 7, 73f., 80, 90, 137 
conditional 91-92 
entropy and 44ff., 74 
genetic, and its transmission 251-58 
limiting 214 
mean 74-75 
conditional 91 
of various messages encountered in 
practice 177ff. 
as continuously varying messages (tele- 
vision images) 228f. 
aS musical notes 222f, 
as phototelegrams 238f 
as spoken language 215f. 
as written language 177f, 
reciprocal, of two events 86 
semantic xvii, 216, 218-19, 221(fn), 228-29 
specific 205, 207 
symmetry property of 91, 93 
total 205-06, 221 
trip'e, equation 92-93, 299 
unsemantic 218-19, 229, 231, 250 
Information theory Vii-viii, xi-xiii, xv-xvii, 
xix 
applications to information transmission 
through communication channels viii, 
137ff. 
Information ttansmission 
error-free 282, 283(fn) 
over noisy channels 258-304 
rate 173, 216(fn), 304-05 
largest 339 
sequential 89 
Input signals 251, 254-55 
Insistence stresses 217 
Inverse of an element 365 
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Italian language 
entropy of 209-10 
letter frequency in 193-94 


Jensen's inequality 352 
general 356 


k-arithmetic 368 

Kazakhian language, letter-guessing experi- 
ments for 196 

Kraft’s inequality 175 


Lagrange’s theorem 369 
Language (see under Entropy; Redundancy; 
also Various languages) 
control tower 212 
intonation 218 
specialized 210, 212 
spoken xiii, 215f. 
statistical structure of 188 
written xiii, 177f. 
Law of 
contradiction 41 
excluded middle 41 
large numbers xvi, 26f., 36, 170 
Least common multiple 40 
Lee distance (metric) 389 
Letter(s), average frequency of 178 
Letter-guessing experiment (see also Guess- 
ing method) 
by Piotrovskii et al. 211 
by Weltner 211 
technique 
Kolmogorov’s postulate on 201-02 
refinement by Cover and King 201 
Linear 
code-word collection 320 
subspace 320-21 
Logarithm 
decimal 46 
factor of transition log,a 45 
Logic, mathematical 41-42 
Logical 
problems 100-08 (see also Recreative pro- 
blems) ‘ 
on geometric probability 42 
propositions 41-42 
Luminosity scales (see Brightness levels) 


m-ary number system 143, 147 
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Mathematical logic 41-42 
Matrix (Matrices) 318, 366 
additive group of 366 
check 32] 
elementary transformations on 390 
equivalent 390 
multiplication 389 
McMillan inequality 175 
Mean 6, 26 
Mean value of 
square of deviation 27 
weighings 133, 136 
Melody (Melodies) 218 
computer-created 226-28 
new, by urn experiments 225 
redundancy for 223 
Stephen Foster’s 224, 228 
Mesh of an element block 243-44 
value calculation by Frolushkin 243 
Message(s) xiii, 136, 251 (see also under In- 
formation) 
continuously varying, television images 
228-38 
entropy of 236-37 
redundancy for 236-37 
discrete, optimal coding/decoding of 252 
input 253-55 
musical (see under Musical) 
output 253-54 
phototelegraphic xiii 
entropy of 238-39, 242 
redundancy for 242 
screen elements of 238-39, 241 
specific entropy of 243 
Statistical laws of xiii, 149 
Message quantization 230 
probable sequences of 171 
Message transmission 
error-free 273, 283 
mean error frequency in 284(fn) 
rate 246-50, 273 
Metric 
Hamming 388-89 
utility in coding theory 389 
Lee 389 
Minkowski 388 
Metric space 387 
Minkowski metric 388 
Mitosis 253 
Monte-Carlo methods 226 


SUBJECT INDEX 


Morpheme 208 

Morse code 138, 140 

Music 222f. 
basic elements 222-23 
chromatic scale 222-23 

Musical compositions 
cowboy songs, American 223, 226 
entropy of, maximum possible 222 
German romantic 224 
Haydn’s 224 
hymns 223-24, 228 
information-theoretic characteristics 224 
note-guessing method due Pinkerton 225 
nursery-rhymes, American 222 
redundancy of 222, 224-26 
rock and roll, American 224 
Schonberg’s 224 

Musical messages 222 

Musical sentences 225 

Mutation 253 


Nerve cells (neurons) 252 
Nerve fibres 
aural 250 
optic 250, 252 
Non-parity 314 
Number(s) 366 
absolute value (norm) of 27, 41-42 
algebra of 37 
comparison of, basic rules 39 
digits 143-44 
greatest common divisor of 40, 365, 375 
idempotent 37 
law of large xvi, 26f., 36, 170 
least common multiple of 40 
Number system 
binary 143 
centenary 145 
decimal 143 
m-ary 143 
ternary 147 


Order of 
field 371 
group 369 
ring 377-78 
Oriental languages 197 
Outcomes 4 
equally probable 3-4, 130-31 
complete system of 42-43 
impossible 55 
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low-probability 59 

mutually exclusive 9 

not equally probable 13] 

proof of feasibility/in feasibility of 128 
Output signal 254-55 


Parity 314 
Parity check 310, 312-13, 315, 318 
matrix of a code 318 
Pause 137 
Phenylalanin 257 
Phoneme(s) 208, 219, 229 
alphabet 219 
average length of 221-22 
spectrogram 220 
Phototelegrams 238f. 
Message transmission by 238-5] 
screen elements employed in 238 
Phrase 208 
Pitch of tone 229 
Polish language, letter guessing experiments 
for 196 
Polynomial(s) 367 
as a group 367 
check 34] 
code generator 329, 342.43 
composite 379 
cyclotomic 331 (fn) 
irreducible 379 
minimal 343 
reducible 381 
roots of 342 
Portuguese language 
entropy of 193, 210 
ietter frequency in 193 
Probabilistic choices reduced to binary choic- 
es 225 
Probability(ies) If. 
addition law of 9, 87 
basic 42 
classical concept of viti, 42-43 
conditional 20f., 61 
definition If., 42 
mean error 283-84, 286-87, 288(fn) 
multiplication law of 11, 22, 25 
properties of 7f. 
table 4 
total, equation of 23 
Probability theory viii, 2, 42-43 
an approach frst indicated by Bernstein 42 
axiomatic construction due Kolmogorov 43 
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Probability theory (contd.) 
definition, classical, introduced by Laplace 
43 
link with Boolean algebra 42 
main problem of 42 
Problems 
counterfeit coin 108-21 
logical 100-08 
on geometric probability 42 
Protein(s) 253-54 
synthesis 254 
twenty-letter language 255 


qg-arithmetic 365-66, 368 
Quantization of messages 230 
Question(s) 122 

cost element of 122 

mean value of 130, 134 
Questionnaire 122 
Q(x)-arithmetic 378 


Random event If. 

Random variable(s) If., 170, 293 
arithmetic mean of 31-32, 34-36 
independent 16-17, 29 
mean value of 6, 26, 54 

any 296-97 

product of 19 

sum of 16, 18, 29, 31 
mutually independent 19, 31, 36 
pairwise independent 36 
product of 15, 18-19 
standard deviation of 27 
sum of 15-16, 18, 29-31 
variance of 26f. 

square root of 27 

Reaction 
complex 57 
error-free 83 
of choice 57 
psychic 56 
simple 57 
time, mean 57-58, 72-73, 83-85 

Reciprocal information of two events 86 

Recreative problems viii, xii, xix, 101, 106, 
108, 137 

Redundancy, zero 188, 213 

Redundancy estimates for 
a language 185, 188, 190-91 (see also under 

Various languages) 
business texts 211 
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literary texts 210-15 
melody 223 
speech 216 
relationship with that of written langu- 
age 22] 
television images 231-37 
typed text 240-41, 244 
Ribonucleic acid (RNA) 254 
messenger (MRNA) 254-55 
molecule, information transmission through 
254-55 
Ribosomes 253-54 
Ring 369, 374 
commutative 369, 374(fn) 
Euclidean 376(fn) 
field as a 375 
order of 377-78 
Rumanian language 
entropy estimates for 209-10, 214 
letter-guessing experiments for 196 
Rumanian speech, low-order entropy esti- 
mates for 220-21 
Russian alphabets 139, 194 
Russian language 
letter-guessing experiments for 196 
redundancy estimates for 196-201 
Russian literary texts, guessing method for 
199-201 
Russian speech, entropy of phonemes in 220 
Russian telegraphic text 212 
Russian tetrametric iambic verse 213 


Samoyan language, entropy and redundancy 
estimates for 196 
‘Saturation’ of a block of elements 243-44 
calculation of values of, by Frolushkin 243 
Semantic information xvii, 216, 218-19, 
221(fn), 228-29 
Sense organs, information receiving capacity 
of 250-51 
Set(s) 
comparison of 39 
complement of a 39.40 
empty 38 
intersection of 38, 40 
ordering of 40 
product of 38 
sum of 37 
super- 43 
union of 37, 40, 90 
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Shannon’s approach (see under Entropy) 
Shannon’s coding theorem (see under Coding) 
Signal(s) 137, 251 
binary 149, 157 
check 310, 312-13, 315, 336, 345 
distortion of 247-48, 258-59 
elementary 138-39, 146-48, 151, 155, 251, 
254, 285(fn) 
average number of 142, 147-48, 155-56 
error-free transmission of 265 
information 310, 312-13, 336, 345 
maximal level of 248 
probabilistic average value of 148 
Sorting of objects 122 
Space 
Euclidean 387 
meiric 387 
subspace, linear (or vector) 320-21, 384-85 
vector 319, 381 
dimension of 382 
Spanish language 
‘first-order approximation’ to 193 
letter frequency in 193 
Speech melodies (see under Melody) 
Sphere 
disjoint, equal, closest packing of 389 
Hamming 338, 340-42, 389 
Statistical regularities, role of, in communi- 
cation lines 55 
Subgroup 
cyclic 369 
index of 369 
Subspace linear (or vector) 320-21, 384-85 
Swedish language, redundancy estimates for 
195 
Syllable 208 
Systematic errors 26, 32 


Tamil alphabets 197-98 

Television images xiii (see under Entropy, 
also Redundancy) 

Ternary number system 147 

Tetrahedron, throw of 25, 90, 93 

Text words, statistical relationship between 
207 

Thesarus xvii 

Thorndike dictionary 59 

Thymine 253-55 

Total probability equation 360 
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Transformations, elementary (see under 
Matrix) 

Transmission band width 247 

Trials ! 
mutually independent 31 

Triangle inequality 387 


Uncertainty 
degree of 44-45, 56-57 
entropy as a measure of the amount of 44f. 
mean value of 54 
measurement (see also under Entropy) 
Hartley’s viewpoint 53, 125 
in binary unit (bit) 45-46 
in decimal unit (dit) xv, 46 
Shannon’s approach to 53 
Unit(s) 
binary (bit) 45-46 
cube 389 
decimal (dit) xv, 46 
Upper bound 
Hamming 339 
Varshamov-Gilbert 316, 336-37 
Uracil 254-55, 257 
Urn problem(s) xii, 2-3, 42, 51 


Variables, random (see Random variables) 
Variance 27-34, 36 ; 
Varshamovy-Gilbert 
inequality 316, 327-28, 335-37, 345 
upper bound 316, 336-37 
Vector(s) 
coordinates of a 342 
linearly dependent 386 
linearly independent 386 
null 366 
Vector-column 366 
Vector-row 366 
Vector space 319, 381 
dimension of 382 


Weather 
experiments 52 
forecast 53, 77 

Words 
as blocks 208 
space between 203 


Zipf principle 209 


